Introduction
Transcription services are utilized in various industries to convert audio or video content into text. Some of the industries that benefit from transcription services include:
Healthcare and Medical Professionals
Medical transcription plays a crucial role in the healthcare industry, transcribing physicians’ recordings with high accuracy and to maintain medical records.
Legal/Law Industry
Law firms, paralegals, court reporters, and attorneys use transcription services for legal purposes, such as transcribing depositions and court hearings.
Businesses
Businesses use transcription services to transcribe board meetings, conferences, interviews, and other events into error-free transcripts for better decision-making and future reference.
Media and Mass Communication
Media professionals, including journalists, video producers, filmmakers, and copywriters, use transcription services to transcribe interviews and other content for articles, press releases, and captions.
Digital Marketing
Digital marketers and content strategists use transcriptionists to convert podcasts, webinars, and other materials into text for blog posts and content creation.
Other Industries
Other industries that benefit from transcription services include market researchers, video and audio podcasters, freelance writers, authors, and keynote speakers.
As we can see, there are quite a few domains where transcription technology is used. Solutions, therefore, that offer fast transcription speeds are in high demand. In this article we’ll look at OpenaAI’s Whisper Large V3 for fast transcription and we’ll also go through a step-by-step process to rev up the speed of transcription to insanely high limits by modifying various parameters in our transformers pipeline.
Starting on E2E Cloud
GPU Requirements
Since we will be incorporating Flash Attention (an algorithm designed to speed up the transformers pipeline), we’ll be working with A100 GPUs because they are compatible with Flash Attention. We’ll be using the cloud computing services offered by E2E for this purpose. Let us create a GPU node for NVIDIA A100 GPUs on the E2E platform.
Make sure you select the second option with CUDA-11 enabled, as we’ll need that for our experiments.
Download the Required Libraries
Then let’s load the extension autotime onto our Jupyter notebook to keep track of the runtimes of our cells.
Next, import the required libraries.
Download the Test Audio File
For this article we will be using the audio of the podcast between Sam Altman and Lex Fridman. The audio is 2.5 hours long. It can be downloaded from the Hugging Face dataset we’ve created by using the following command.
Case 1: Using the Base Transformer Model
For the first case, we’ll just be using our base transformer model without any adjustments made to the parameters and then take a look at the transcription time.
So this took 19 min 46 sec to transcribe the entire audio.
Outputting the first 200 words for quality checks:
outputs["text"][:200]
Case 2: Batching
Batching in transformers refers to the process of grouping multiple input sequences together into a single batch for processing by the model. This is done to improve the efficiency of the model during training and inference.
With this method, the transcription time goes down to 5 min 32 sec, which is about 28% of our original time.
outputs["text"][:200]
Case 3: Half Precision
In the code above, torch.float16 refers to the half-precision floating-point format, also known as FP16. Half precision reduces memory usage of the neural network, allowing training and deployment of larger networks, and transfers data in less time compared to higher precision formats such as FP32 or FP64.
outputs["text"][:200]
Case 4: BetterTransformer
When a model is converted to a BetterTransformer using to_bettertransformer(), it benefits from the accelerated implementation of the attention mechanism of Transformers, leading to faster inference and improved memory efficiency.
Case 5: BetterTransformer + Higher Batch Size
A larger batch size can provide computational efficiency by parallelizing operations across examples. This can speed up training since more computations are performed simultaneously.
Case 6: Flash Attention
Flash Attention is an attention algorithm used to reduce the memory bottleneck and scale transformer-based models more efficiently, enabling faster training and inference. It leverages classical techniques such as tiling and recomputation to achieve remarkable boost in speed and a substantial reduction in memory usage.
First install the following libraries:
You will have to restart the kernel before you can use Flash Attention. So please do that.
outputs["text"][:200]
Conclusion
Below is a table listing the transcription time of the different methods we used:
As we can see, there is a progressive decrease in transcription time until we reach the insanely fast transcription time of 2 min 2 sec.
References
https://github.com/Vaibhavs10/insanely-fast-whisper/blob/main/notebooks/infer_transformers_whisper_large_v2.ipynb