Techniques to make deep learning efficient: Pruning and Leverage Sparse Tensor Cores of A100

April 2, 2025

Deep learning has revolutionized the field of computer vision, natural language processing, generative ai and more. However this leads to models with higher number of parameters, latency and computational resources requirement. Neural network pruning can reduce the parameter counts of neural networks by more than 90% and hence decreasing the storage requirements and improving computation efficiency of neural networks.

A data practitioner may face following challenges when trying to deploy a model for inference:

Running inference for a long time on scale could lead to higher cost due to consumption of more server side CPU, GPU, RAM etc.
Some deep learning models need to run on edge devices such as IoT and smart devices. These devices are limited in terms of resources and model optimization is a must in such case.

What is Efficient Inferencing?

Some pertinent question to ask before deploying a model are:

‍

Is the model small?

Is it fast?

How many parameters model have?

What is RAM consumption during inference?

What is inference latency?

‍

How to achieve efficient inferencing?

Compression Techniques: In these techniques layers of models are compressed. Two of these techniques are:

Pruning
Quantization

What is Pruning?

In simple words pruning is to make neural networks smaller by removing synapses and neurons.‍

Pruning in Human Brain

Pruning happens in the human brain. A newborn has nearly 2500 synapses per neuron which surges in the first few years of child growth but after nearly 4 years they start decreasing. It is quite intuitive to grasp as the brain optimizes the neural networks by removing some of the connections or synapses.

‍

‍

‍

Source:Source

Given a neural network 𝑓 (𝑋,𝑊 ), where 𝑋 is the input and𝑊 is the set of parameters (or weights), pruning is a technique for coming up with a minimal subset 𝑊 ′ such that the rest of the parameters of 𝑊 are pruned (or set to 0), while ensuring that the quality of the model remains above the desired threshold. After pruning, we can say the network has been made sparse, where the sparsity can be quantified as the ratio of the number of parameters that were pruned to the number of parameters in the original network (𝑠 = (1 − |𝑊 ′ | |𝑊 | )). The higher the sparsity, the lesser the number of non-zero parameters in the pruned networks. (Source: https://arxiv.org/abs/2106.08962).

A typical workflow to construct a pruned network has following three steps:

Train a dense network until convergence
Prune the network to remove unwanted structure
Retrain the network

‍

Lottery Ticket Hypothesis: The idea of sparse structure within a dense model is inspired from lottery ticket hypothesis which states that:

“A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations“

‍

‍

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

‍

Efficient Methods and Hardware for Deep Learning [Han S., Stanford University]

‍

How to Prune?

What synapses and neurons should we prune?

When removing parameters from a neural network model, the less important the parameters being removed are, better performance of the pruned neural network.

If only some weights have to be removed, which one? Why?

Magnitude-based Pruning: Magnitude-based pruning considers weights with larger absolute values are more important than other weights. For element-wise pruning, Importance = |W|.

‍

‍

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

N:M sparsity in A100 via pruning

The NVIDIA A100 GPU adds support for fine-grained structured sparsity to its Tensor Cores. Sparse Tensor Cores accelerate a 2:4 sparsity pattern. In each contiguous block of four values, two values must be zero. This naturally leads to a sparsity of 50%, which is fine-grained. There are no vector or block structures pruned together. Such a regular pattern is easy to compress and has a low metadata overhead.

‍

‍

(Source: Source)

Routine for training a pruned network following a N:M structured sparsity pattern is:

Start with a dense network
On the dense network, prune the weights to satisfy the 2:4 structured sparsity criteria. Out of every four elements, remove just two.
Repeat the original training procedure.

Turning half of a network's weights to zero can affect the accuracy of the network, as you might expect. Step 3 recovers that accuracy with enough weight update steps to let the weights converge and a high enough learning rate to let the weights move around sufficiently.

Performance in TensorRT 8.0

‍

Source

Along with using techniques for efficient deep learning you can optimize cost by choosing appropriate cloud GPU platforms. E2E Cloud provides a range of GPUs for all kinds of deep learning and graphics workload at the most affordable price in the market. Try our platform and CloudGPUs with a free trial. To get your free credits contact: sales@e2enetworks.com

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Techniques to make deep learning efficient: Pruning and Leverage Sparse Tensor Cores of A100

Example H2

A data practitioner may face following challenges when trying to deploy a model for inference:

Running inference for a long time on scale could lead to higher cost due to consumption of more server side CPU, GPU, RAM etc.
Some deep learning models need to run on edge devices such as IoT and smart devices. These devices are limited in terms of resources and model optimization is a must in such case.

What is Efficient Inferencing?

Some pertinent question to ask before deploying a model are:

‍

Is the model small?

Is it fast?

How many parameters model have?

What is RAM consumption during inference?

What is inference latency?

‍

How to achieve efficient inferencing?

Compression Techniques: In these techniques layers of models are compressed. Two of these techniques are:

Pruning
Quantization

What is Pruning?

In simple words pruning is to make neural networks smaller by removing synapses and neurons.‍

Pruning in Human Brain

‍

‍

‍

Source:Source

A typical workflow to construct a pruned network has following three steps:

Train a dense network until convergence
Prune the network to remove unwanted structure
Retrain the network

‍

Lottery Ticket Hypothesis: The idea of sparse structure within a dense model is inspired from lottery ticket hypothesis which states that:

‍

‍

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

‍

Efficient Methods and Hardware for Deep Learning [Han S., Stanford University]

‍

How to Prune?

What synapses and neurons should we prune?

When removing parameters from a neural network model, the less important the parameters being removed are, better performance of the pruned neural network.

If only some weights have to be removed, which one? Why?

Magnitude-based Pruning: Magnitude-based pruning considers weights with larger absolute values are more important than other weights. For element-wise pruning, Importance = |W|.

‍

‍

Learning Both Weights and Connections for Efficient Neural Network [Han et al., NeurIPS 2015]

N:M sparsity in A100 via pruning

‍

‍

(Source: Source)

Routine for training a pruned network following a N:M structured sparsity pattern is:

Start with a dense network
On the dense network, prune the weights to satisfy the 2:4 structured sparsity criteria. Out of every four elements, remove just two.
Repeat the original training procedure.

Performance in TensorRT 8.0

‍

Source

Latest Blogs

Techniques to make deep learning efficient: Pruning and Leverage Sparse Tensor Cores of A100

Table of Contents

Techniques to make deep learning efficient: Pruning and Leverage Sparse Tensor Cores of A100

Table of Contents

7 Cloud Cost Optimization Mistakes to Avoid

A Comparison between TIR Containerized VMs vs Traditional VMs

High Resolution Image Synthesis with Stable Diffusion

What is the relationship between maximizing batch size and GPU processor utilization?

What Is Horovod Distributed Framework and How Can You Deploy It on E2E Cloud?

Modern Face Recognition with deep learning

Multi-master replication solution for PostgreSQL

Moving to the cloud - few advantages for your business

Google Search rankings now affected by whether your website has HTTPS or not

Introduction to NumPy - A Python Library