Deep learning has revolutionized the field of computer vision, natural language processing, generative ai and more. However this leads to models with higher number of parameters, latency and computational resources requirement. Neural network pruning can reduce the parameter counts of neural networks by more than 90% and hence decreasing the storage requirements and improving computation efficiency of neural networks.
A data practitioner may face following challenges when trying to deploy a model for inference:
- Running inference for a long time on scale could lead to higher cost due to consumption of more server side CPU, GPU, RAM etc.
- Some deep learning models need to run on edge devices such as IoT and smart devices. These devices are limited in terms of resources and model optimization is a must in such case.
What is Efficient Inferencing?
Some pertinent question to ask before deploying a model are:
Is the model small?
Is it fast?
How many parameters model have?
What is RAM consumption during inference?
What is inference latency?
How to achieve efficient inferencing?
Compression Techniques: In these techniques layers of models are compressed. Two of these techniques are:
- Pruning
- Quantization
What is Pruning?
In simple words pruning is to make neural networks smaller by removing synapses and neurons.
Pruning in Human Brain
Pruning happens in the human brain. A newborn has nearly 2500 synapses per neuron which surges in the first few years of child growth but after nearly 4 years they start decreasing. It is quite intuitive to grasp as the brain optimizes the neural networks by removing some of the connections or synapses.
Source:Source
Given a neural network 𝑓 (𝑋,𝑊 ), where 𝑋 is the input and𝑊 is the set of parameters (or weights), pruning is a technique for coming up with a minimal subset 𝑊 ′ such that the rest of the parameters of 𝑊 are pruned (or set to 0), while ensuring that the quality of the model remains above the desired threshold. After pruning, we can say the network has been made sparse, where the sparsity can be quantified as the ratio of the number of parameters that were pruned to the number of parameters in the original network (𝑠 = (1 − |𝑊 ′ | |𝑊 | )). The higher the sparsity, the lesser the number of non-zero parameters in the pruned networks. (Source: https://arxiv.org/abs/2106.08962).
A typical workflow to construct a pruned network has following three steps:
- Train a dense network until convergence
- Prune the network to remove unwanted structure
- Retrain the network
Lottery Ticket Hypothesis: The idea of sparse structure within a dense model is inspired from lottery ticket hypothesis which states that:
“A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations“
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]
Efficient Methods and Hardware for Deep Learning [Han S., Stanford University]
How to Prune?
What synapses and neurons should we prune?
When removing parameters from a neural network model, the less important the parameters being removed are, better performance of the pruned neural network.
If only some weights have to be removed, which one? Why?
Magnitude-based Pruning: Magnitude-based pruning considers weights with larger absolute values are more important than other weights. For element-wise pruning, Importance = |W|.
Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]
N:M sparsity in A100 via pruning
The NVIDIA A100 GPU adds support for fine-grained structured sparsity to its Tensor Cores. Sparse Tensor Cores accelerate a 2:4 sparsity pattern. In each contiguous block of four values, two values must be zero. This naturally leads to a sparsity of 50%, which is fine-grained. There are no vector or block structures pruned together. Such a regular pattern is easy to compress and has a low metadata overhead.
(Source: Source)
Routine for training a pruned network following a N:M structured sparsity pattern is:
- Start with a dense network
- On the dense network, prune the weights to satisfy the 2:4 structured sparsity criteria. Out of every four elements, remove just two.
- Repeat the original training procedure.
Turning half of a network's weights to zero can affect the accuracy of the network, as you might expect. Step 3 recovers that accuracy with enough weight update steps to let the weights converge and a high enough learning rate to let the weights move around sufficiently.
Performance in TensorRT 8.0
Along with using techniques for efficient deep learning you can optimize cost by choosing appropriate cloud GPU platforms. E2E Cloud provides a range of GPUs for all kinds of deep learning and graphics workload at the most affordable price in the market. Try our platform and CloudGPUs with a free trial. To get your free credits contact: sales@e2enetworks.com