Insanely Fast Text Transcription from Audio or Video Content Using Whisper Large V3

April 2, 2025

Table of Contents

Introduction

Transcription services are utilized in various industries to convert audio or video content into text. Some of the industries that benefit from transcription services include:

Healthcare and Medical Professionals

Medical transcription plays a crucial role in the healthcare industry, transcribing physicians’ recordings with high accuracy and to maintain medical records.

Legal/Law Industry

Law firms, paralegals, court reporters, and attorneys use transcription services for legal purposes, such as transcribing depositions and court hearings.

Businesses

Businesses use transcription services to transcribe board meetings, conferences, interviews, and other events into error-free transcripts for better decision-making and future reference.

Media and Mass Communication

Media professionals, including journalists, video producers, filmmakers, and copywriters, use transcription services to transcribe interviews and other content for articles, press releases, and captions.

Digital Marketing

Digital marketers and content strategists use transcriptionists to convert podcasts, webinars, and other materials into text for blog posts and content creation.

Other Industries

Other industries that benefit from transcription services include market researchers, video and audio podcasters, freelance writers, authors, and keynote speakers.

As we can see, there are quite a few domains where transcription technology is used. Solutions, therefore, that offer fast transcription speeds are in high demand. In this article we’ll look at OpenaAI’s Whisper Large V3 for fast transcription and we’ll also go through a step-by-step process to rev up the speed of transcription to insanely high limits by modifying various parameters in our transformers pipeline.  

Starting on E2E Cloud

GPU Requirements

Since we will be incorporating Flash Attention (an algorithm designed to speed up the transformers pipeline), we’ll be working with A100 GPUs because they are compatible with Flash Attention. We’ll be using the cloud computing services offered by E2E for this purpose. Let us create a GPU node for NVIDIA A100 GPUs on the E2E platform.

Make sure you select the second option with CUDA-11 enabled, as we’ll need that for our experiments.

Download the Required Libraries


!pip install -q --upgrade torch torchvision torchaudio
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q accelerate optimum
!pip install -q ipython-autotime


!sudo apt install ffmpeg

Then let’s load the extension autotime onto our Jupyter notebook to keep track of the runtimes of our cells.


%load_ext autotime

Next, import the required libraries.


import torch
from transformers import pipeline

Download the Test Audio File

For this article we will be using the audio of the podcast between Sam Altman and Lex Fridman. The audio is 2.5 hours long. It can be downloaded from the Hugging Face dataset we’ve created by using the following command. 


!wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/sam_altman_lex_podcast_367.flac

Case 1: Using the Base Transformer Model

For the first case, we’ll just be using our base transformer model without any adjustments made to the parameters and then take a look at the transcription time. 


pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-large-v2",
                device="cuda:0")



outputs = pipe("sam_altman_lex_podcast_367.flac",
               chunk_length_s=30)

So this took 19 min 46 sec to transcribe the entire audio.

Outputting the first 200 words for quality checks:

outputs["text"][:200]

Case 2: Batching

Batching in transformers refers to the process of grouping multiple input sequences together into a single batch for processing by the model. This is done to improve the efficiency of the model during training and inference.


outputs = pipe("sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=8)
               

With this method, the transcription time goes down to 5 min 32 sec, which is about 28% of our original time.

outputs["text"][:200]

Case 3: Half Precision


pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-large-v2",
                torch_dtype=torch.float16,
                device="cuda:0")

In the code above, torch.float16 refers to the half-precision floating-point format, also known as FP16. Half precision reduces memory usage of the neural network, allowing training and deployment of larger networks, and transfers data in less time compared to higher precision formats such as FP32 or FP64.


outputs = pipe("sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=16,
               return_timestamps=True)
               

outputs["text"][:200] 

Case 4: BetterTransformer

When a model is converted to a BetterTransformer using to_bettertransformer(), it benefits from the accelerated implementation of the attention mechanism of Transformers, leading to faster inference and improved memory efficiency.


pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-large-v2",
                torch_dtype=torch.float16,
                device="cuda:0")

pipe.model = pipe.model.to_bettertransformer()


outputs = pipe("sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=16,
               return_timestamps=True)
               

Case 5: BetterTransformer + Higher Batch Size

A larger batch size can provide computational efficiency by parallelizing operations across examples. This can speed up training since more computations are performed simultaneously. 

Case 6: Flash Attention     

Flash Attention is an attention algorithm used to reduce the memory bottleneck and scale transformer-based models more efficiently, enabling faster training and inference. It leverages classical techniques such as tiling and recomputation to achieve remarkable boost in speed and a substantial reduction in memory usage.

First install the following libraries:


pip uninstall -y ninja && pip install ninja
pip install wheel setuptools
pip install flash-attn --no-build-isolation

You will have to restart the kernel before you can use Flash Attention. So please do that.


pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-large-v2",
                torch_dtype=torch.float16,
                model_kwargs={"use_flash_attention_2": True},
                device="cuda:0")



outputs = pipe("sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=24,
               return_timestamps=True)

outputs["text"][:200]

Conclusion

Below is a table listing the transcription time of the different methods we used:

As we can see, there is a progressive decrease in transcription time until we reach the insanely fast transcription time of 2 min 2 sec.

References

https://github.com/Vaibhavs10/insanely-fast-whisper/blob/main/notebooks/infer_transformers_whisper_large_v2.ipynb

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure