Benchmarking Open ASR Models on NVIDIA L4: Parakeet vs Whisper vs Nemotron Speech

Open-weight ASR has reached a point where the model choice is only half the decision. The other half is configuration — and most teams get it wrong by default.

Batch size, precision, attention implementation, chunk length, decoding strategy: each of these parameters affects throughput, accuracy, and memory usage in ways that are not obvious and not consistent across models. What works for Whisper does not apply to Parakeet. What is optimal on an A100 is not optimal on an L4.

We wanted concrete answers for the hardware we offer. So we took the three most widely used open ASR models, spun up an NVIDIA L4 (23 GB) instance on E2E Cloud, and ran 58 configurations — baseline through every meaningful software optimization — measuring WER, throughput, latency, VRAM, and power draw for each one.

This post is what we found.

Key Takeaways

Parakeet bf16 batch=8 hits 238× real-time throughput — one hour of audio in 15 seconds
Nemotron hits 258× real-time with WER that never moves across any configuration
SDPA attention gives Whisper a 1.8× free throughput improvement — most deployments aren't using it
Beam search on Parakeet is strictly counterproductive: 2× slower with no accuracy gain
Whisper chunk=10s causes a 3.5% absolute WER regression — a hidden accuracy trap
All three models fit on 23 GB with significant headroom

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Claim Free Credits View Pricing

The Problem with Default ASR Configurations

Every ASR library ships with defaults. Those defaults are reasonable starting points, but they are not production configurations.

The parameters that actually move the needle are not obvious:

Batch size — the single most impactful lever, but the optimal value is hardware-specific and model-specific. What works on an A100 does not work the same way on an L4.
Attention implementation — sdpa vs eager in Transformers. One line of code. 1.8× throughput difference on Whisper.
Precision — fp16 vs bf16 vs fp32. The accuracy impact varies by model and is not always what you expect.
Chunk length — for Whisper on long audio. Reducing it for lower latency causes a silent accuracy regression most benchmarks never report.
Decoding strategy — beam search vs greedy. On encoder-decoder models like Whisper, the trade-off is real. On transducer models like Parakeet, beam search is a liability.
Streaming frame length — for Nemotron. Turns out this is a pure latency knob with zero accuracy effect.

These parameters interact. A larger batch size with eager attention is slower than a smaller batch size with SDPA. You need to test combinations, not individual knobs in isolation.

That is what this benchmark does.

1. The Three Models

Model	Parameters	Architecture	Family
openai/whisper-large-v3-turbo	809 M	Encoder-Decoder Transformer	OpenAI Whisper
nvidia/parakeet-tdt-0.6b-v3	600 M	FastConformer + TDT Decoder	NVIDIA NeMo
nvidia/nemotron-speech-streaming-en-0.6b	600 M	Cache-Aware FastConformer + RNNT	NVIDIA NeMo

Whisper-large-v3-turbo is OpenAI's distilled variant of large-v3 — faster, with a marginal accuracy trade-off. It is the most robust of the three on noisy and accented audio, reflecting its training on hundreds of thousands of hours of diverse web audio.

Parakeet-TDT-0.6b-v3 uses NVIDIA's Token-and-Duration Transducer decoder on top of a FastConformer encoder. It is designed specifically for high-throughput offline transcription and its numbers reflect that.

Nemotron-Speech-Streaming-en-0.6b is architecturally different from the other two. Its Cache-Aware FastConformer encodes each audio frame exactly once and reuses the cached state across steps. This means chunk size does not affect how much acoustic context the model sees — only when it sees it. The practical implication is significant and shows up directly in the benchmark results.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Claim Free Credits View Pricing

2. Test Setup

GPU: NVIDIA L4 — 23,034 MiB VRAM

CUDA: 12.4

PyTorch: 2.6.0

Stack: HuggingFace Transformers 4.43+ | NVIDIA NeMo 2.7.0

Dataset: LibriSpeech test-clean (50 samples) + test-other (50 samples)

Metrics Collected Per Config

Metric	Definition
WER	Word Error Rate vs LibriSpeech ground truth
CER	Character Error Rate
RTF	inference_time / audio_duration — lower is better
Throughput	Real-time multiplier
p50 Latency	Median per-utterance inference time
Peak VRAM	Maximum GPU memory (nvidia-smi polling)
Avg Power	Mean watts during inference run
Energy / hr audio	Wh consumed per hour of audio processed

Configurations Tested

Dimension	Whisper	Parakeet	Nemotron
Batch size	1, 4, 8, 16, 32	1, 4, 8, 16, 32, 64	1, 2, 4, 8, 16
Precision	fp16, bf16, int8	fp16, bf16, fp32	fp16, bf16
Attention	sdpa, eager	—	—
Beam size	1, 4, 8	1, 4	—
Chunk / frame length	10s, 20s, 30s	—	0.1s, 0.5s, 1s, 2s, 4s
condition_on_prev	True, False	—	—
return_timestamps	True, False	—	—
Total configs	20	20	18

Note on WER for NeMo models: Parakeet and Nemotron output punctuated text; LibriSpeech references are plain text. NeMo WER is inflated by approximately 2–3% absolute vs post-normalised scores. Comparisons within each model are fully valid; cross-model WER comparison should account for this.

3. Whisper Results and Findings

Full Results

Config	WER	RTF	Throughput	p50 Lat	VRAM	Power	Energy
w01 — baseline (bs=1, fp16, sdpa, chunk=30)	8.93%	0.0292	34.2×	0.198s	2,093 MB	68.6 W	2.00 Wh/hrA
w02 — accurate real-time (bs=1, beam=4)	8.38%	0.0370	27.1×	0.244s	2,113 MB	69.1 W	2.56 Wh/hrA
w03 — batch-4 greedy (fp16, sdpa)	8.93%	0.0248	40.4×	0.176s	2,291 MB	70.3 W	1.74 Wh/hrA
w04 — batch-8 greedy	8.98%	0.0272	36.7×	0.194s	2,719 MB	68.1 W	1.85 Wh/hrA
w05 — batch-16 greedy	8.93%	0.0270	37.1×	0.193s	3,195 MB	66.1 W	1.78 Wh/hrA
w06 — batch-32 greedy	8.93%	0.0277	36.1×	0.201s	3,945 MB	65.4 W	1.81 Wh/hrA
w07 — max accuracy (bs=1, beam=8)	8.43%	0.0471	21.2×	0.311s	2,303 MB	69.7 W	3.28 Wh/hrA
w08 — INT8 cost-optimised (bs=1)	8.43%	0.0638	15.7×	0.423s	1,207 MB	50.9 W	3.25 Wh/hrA
w09 — INT8 batch-8	8.76%	0.0420	23.8×	0.304s	1,867 MB	61.6 W	2.59 Wh/hrA
w10 — bf16 batch-4	8.93%	0.0243	41.1×	0.173s	2,299 MB	64.9 W	1.58 Wh/hrA
w11 — bf16 batch-8	8.98%	0.0266	37.6×	0.191s	2,727 MB	65.2 W	1.74 Wh/hrA
w12 — eager attn (bs=4, fp16)	8.93%	0.0449	22.3×	0.324s	3,589 MB	69.3 W	3.11 Wh/hrA
w13 — chunk-10 (bs=4)	12.43% ⚠️	0.0328	30.5×	0.234s	2,373 MB	67.0 W	2.20 Wh/hrA
w14 — chunk-20 (bs=4)	8.60%	0.0266	37.7×	0.186s	2,297 MB	68.4 W	1.82 Wh/hrA
w15 — batch-8 beam-4 balanced	8.38%	0.0388	25.8×	0.276s	3,873 MB	69.5 W	2.70 Wh/hrA
w16 — timestamps (bs=4)	8.21%	0.0268	37.3×	0.193s	2,299 MB	71.5 W	1.92 Wh/hrA
w17 — max throughput bf16 (bs=32)	8.87%	0.0269	37.2×	0.198s	3,953 MB	66.2 W	1.78 Wh/hrA
w18 — condition_on_prev (bs=4)	8.93%	0.0257	38.9×	0.186s	2,299 MB	72.3 W	1.86 Wh/hrA
w19 — bf16 accurate (bs=1, beam=4)	8.11%	0.0371	26.9×	0.248s	2,121 MB	71.1 W	2.64 Wh/hrA
w20 — batch-16 beam-4	8.38%	0.0402	24.9×	0.292s	5,411 MB	67.5 W	2.72 Wh/hrA

Finding 1: SDPA vs Eager — 1.8× throughput, zero accuracy cost

Comparing w03 (SDPA, batch=4, fp16) and w12 (eager, batch=4, fp16) isolates the attention implementation with everything else held constant. SDPA runs at 40.4× real-time; eager runs at 22.3×. VRAM drops from 3,589 MB to 2,291 MB. WER is identical at 8.93%.

The reason: PyTorch's SDPA dispatches to FlashAttention-2 kernels internally when available, fusing the attention computation into fewer GPU operations. Eager mode executes attention as separate matrix multiplications with intermediate activations stored in VRAM. The difference is not marginal — it is 1.8× throughput for one parameter change.

Most Whisper deployments use attn_implementation='eager' because it is the older default. Switching to 'sdpa' is the first change any Whisper deployment should make.

Finding 2: Batch size saturates at 4, not 16

Throughput peaks at batch=4 (40.4×) and declines at larger batch sizes — batch=8 drops to 36.7×, batch=16 to 37.1×, batch=32 to 36.1×. The L4's memory bandwidth is the constraint. At batch=4, the GPU is fully utilised. Beyond that, the overhead of managing larger batches through Whisper's chunked pipeline outweighs the parallelism benefit.

This is different from what you'd see on an L40S or A100, where the saturation point is higher. The optimal Whisper batch size is hardware-specific — do not copy configs from larger-GPU benchmarks directly.

Finding 3: Chunk-10 is an accuracy trap

Reducing chunk length from 30s to 10s causes WER to jump from 8.93% to 12.43% — a 3.5% absolute regression. Chunk=20 is much safer at 8.60%. The throughput gain from chunk=10 (30.5×) does not compensate for the accuracy loss, especially when batch=4 at chunk=30 already achieves 40.4× at 8.93% WER.

The cause: Whisper's encoder needs sufficient context to resolve ambiguous phonemes and cross-word boundaries. At 10s windows, many utterances are truncated mid-phrase, and the model's conditioning on previous chunks fails to fully recover. Do not reduce Whisper chunk size for throughput — use batch size instead.

Finding 4: condition_on_prev_tokens has no measurable effect

w18 (condition_on_prev=True, bs=4) and w03 (condition_on_prev=False, bs=4) produce identical WER (8.93%) and near-identical throughput (38.9× vs 40.4×). On LibriSpeech — clean, structured read speech — cross-chunk context conditioning adds nothing. On highly conversational or fragmented audio, the result might differ. For clean speech workloads, this parameter does not matter.

Finding 5: INT8 is a VRAM play, not a throughput play

INT8 batch=1 (w08) achieves the lowest VRAM of any Whisper config: 1,207 MB. But throughput drops to 15.7× — less than half the fp16 baseline. At batch=8 with INT8 (w09), throughput recovers to 23.8× with VRAM at 1,867 MB.

INT8 is the right choice when you need to pack multiple inference workers onto a single GPU. It is not the right choice for maximising single-stream throughput.

4. Parakeet Results and Findings

Full Results

Config	WER	RTF	Throughput	p50 Lat	VRAM	Power	Energy
p01 — baseline (bs=1, fp16, greedy)	15.77%	0.0125	79.9×	0.087s	5,171 MB	39.0 W	0.487 Wh/hrA
p02 — batch-4, fp16	15.77%	0.0050	200.2×	0.033s	5,147 MB	45.9 W	0.229 Wh/hrA
p03 — batch-8, fp16	15.72%	0.0044	228.3×	0.031s	5,143 MB	50.0 W	0.219 Wh/hrA
p04 — batch-16, fp16	15.77%	0.0046	218.0×	0.034s	5,143 MB	51.6 W	0.237 Wh/hrA
p05 — batch-32, fp16	15.77%	0.0047	214.7×	0.035s	5,143 MB	53.1 W	0.247 Wh/hrA
p06 — batch-64 max	15.77%	0.0049	204.8×	0.037s	7,143 MB	54.4 W	0.266 Wh/hrA
p07 — bf16 baseline	15.72%	0.0120	83.0×	0.088s	5,153 MB	43.8 W	0.527 Wh/hrA
p08 — bf16 batch-4	15.77%	0.0048	207.1×	0.031s	5,155 MB	48.0 W	0.232 Wh/hrA
p09 — bf16 batch-8	15.72%	0.0042	238.9×	0.029s	5,151 MB	50.9 W	0.213 Wh/hrA
p10 — bf16 batch-16	15.77%	0.0044	225.6×	0.032s	5,151 MB	55.5 W	0.246 Wh/hrA
p11 — bf16 batch-32	15.72%	0.0045	219.8×	0.034s	5,151 MB	54.0 W	0.246 Wh/hrA
p12 — fp32 max accuracy	15.72%	0.0118	84.9×	0.088s	5,135 MB	51.8 W	0.610 Wh/hrA
p13 — fp32 batch-4	15.72%	0.0052	191.7×	0.036s	5,139 MB	58.3 W	0.304 Wh/hrA
p14 — beam-4 (bs=1)	15.83% ⚠️	0.0273	36.6×	0.165s	5,171 MB	39.5 W	1.079 Wh/hrA
p15 — batch-4 beam-4	15.94% ⚠️	0.0205	48.8×	0.133s	5,147 MB	40.6 W	0.831 Wh/hrA
p16 — batch-8 beam-4	15.88% ⚠️	0.0197	50.7×	0.136s	5,147 MB	41.7 W	0.822 Wh/hrA
p17 — batch-16 beam-4	15.99% ⚠️	0.0199	50.2×	0.146s	5,143 MB	42.4 W	0.844 Wh/hrA
p18 — bf16 beam-4 (bs=1)	15.77%	0.0200	50.1×	0.131s	5,155 MB	42.2 W	0.843 Wh/hrA
p19 — bf16 batch-8 beam-4	15.61%	0.0194	51.4×	0.134s	5,155 MB	41.4 W	0.805 Wh/hrA
p20 — bf16 batch-32	15.72%	0.0047	214.7×	0.034s	7,149 MB	53.8 W	0.251 Wh/hrA

Finding 1: The throughput curve peaks at batch=8 and declines after

Going from batch=1 to batch=8 in fp16, throughput climbs from 79.9× to 228.3×. Then it declines: batch=16 drops to 218.0×, batch=32 to 214.7×, batch=64 to 204.8×. VRAM stays nearly constant from batch=1 through batch=32 (~5,143 MB), then jumps at batch=64 (7,143 MB) — without any throughput benefit.

The reason: NeMo's internal batching and dispatcher overhead increases with batch size. At batch=8, the GPU is fully saturated. Beyond that, the overhead of preparing, padding, and scheduling larger batches through the TDT decoder outweighs the additional parallelism. batch=8 wins on throughput, VRAM, and energy simultaneously.

Finding 2: Beam search on Parakeet is strictly worse on all axes

This is the clearest negative result in the benchmark. Beam=4 at batch=1 (p14) runs at 36.6× — less than half the greedy throughput at the same batch size (79.9×). Beam=4 WER (15.83%) is higher than greedy WER (15.77%). Every beam search config underperforms the equivalent greedy config on both speed and accuracy.

Why: TDT decoders predict both tokens and their durations jointly. The blank token mechanism already implicitly prunes low-probability paths during greedy decoding. Beam search explores alternative paths that the model's joint scoring function cannot meaningfully rank — producing no accuracy gain while multiplying decode time. Never use beam search with Parakeet.

Finding 3: FP32 provides no accuracy benefit

p12 (fp32, bs=1) and p01 (fp16, bs=1) both land at 15.72–15.77% WER. p12 draws 51.8W vs 39.0W for p01. FP32 adds 33% more power consumption with zero accuracy return on this model. Use fp16 or bf16 — fp32 is wasted compute for Parakeet.

Finding 4: bf16 batch=8 is the optimal config

p09 (bf16 batch=8) achieves 238.9× real-time at 50.9W and 0.213 Wh per hour of audio — the best combination of throughput, energy, and VRAM in the entire Parakeet grid.

5. Nemotron Results and Findings

Full Results

Config	WER	RTF	Throughput	p50 Lat	VRAM	Power	Energy
n01 — bs=1, fp16, frame=0.1s	10.30%	0.0119	83.8×	0.086s	1,917 MB	39.6 W	0.472 Wh/hrA
n02 — bs=1, fp16, frame=0.5s	10.30%	0.0115	86.9×	0.088s	1,713 MB	41.5 W	0.477 Wh/hrA
n03 — bs=1, fp16, frame=1.0s	10.30%	0.0115	87.2×	0.084s	1,753 MB	44.0 W	0.505 Wh/hrA
n04 — bs=1, fp16, frame=2.0s	10.30%	0.0114	87.4×	0.087s	1,793 MB	45.6 W	0.521 Wh/hrA
n05 — bs=1, fp16, frame=4.0s	10.30%	0.0114	87.8×	0.087s	1,833 MB	47.3 W	0.539 Wh/hrA
n06 — batch-2, fp16, frame=1.0s	10.30%	0.0063	158.0×	0.046s	1,955 MB	44.1 W	0.279 Wh/hrA
n07 — batch-4, fp16, frame=1.0s	10.30%	0.0049	202.5×	0.025s	2,195 MB	44.7 W	0.221 Wh/hrA
n08 — batch-8, fp16, frame=1.0s	10.30%	0.0045	221.7×	0.023s	2,565 MB	49.1 W	0.221 Wh/hrA
n09 — batch-16, fp16, frame=1.0s	10.30%	0.0040	247.5×	0.020s	3,267 MB	48.3 W	0.195 Wh/hrA
n10 — batch-4, fp16, frame=0.5s	10.30%	0.0046	219.6×	0.023s	2,435 MB	40.9 W	0.186 Wh/hrA
n11 — batch-8, fp16, frame=0.5s	10.30%	0.0039	258.9×	0.018s	2,785 MB	48.3 W	0.187 Wh/hrA
n12 — batch-4, fp16, frame=2.0s	10.30%	0.0045	220.2×	0.023s	2,515 MB	44.2 W	0.201 Wh/hrA
n13 — batch-8, fp16, frame=2.0s	10.30%	0.0039	258.5×	0.017s	2,805 MB	49.1 W	0.190 Wh/hrA
n14 — bs=1, bf16, frame=1.0s	10.46%	0.0115	87.0×	0.088s	2,241 MB	48.7 W	0.560 Wh/hrA
n15 — batch-4, bf16, frame=1.0s	10.41%	0.0047	211.9×	0.024s	2,483 MB	51.6 W	0.244 Wh/hrA
n16 — batch-8, bf16, frame=1.0s	10.46%	0.0040	253.1×	0.018s	2,773 MB	45.7 W	0.181 Wh/hrA
n17 — batch-4, bf16, frame=0.5s	10.41%	0.0046	215.7×	0.024s	2,483 MB	52.7 W	0.244 Wh/hrA
n18 — batch-8, bf16, frame=2.0s	10.46%	0.0039	256.4×	0.018s	2,773 MB	48.0 W	0.187 Wh/hrA

Finding 1: Frame length has zero effect on accuracy

WER is 10.30% across every single fp16 configuration — n01 through n13, from frame=0.1s to frame=4.0s, from batch=1 to batch=16. Not a single decimal point of change.

This is the cache-aware architecture working exactly as designed. Because the encoder processes each audio frame once and reuses its cached state, the model's acoustic representation does not change with chunk size. Frame length is a pure latency knob — you can tune it freely without touching accuracy.

For a live voice agent needing sub-20ms latency, use n11 (batch=8, frame=0.5s, p50=0.018s). For overnight batch transcription, use n09 (batch=16, frame=1.0s, 247.5× RT). Same weights, same accuracy, completely different operational profile.

Finding 2: Nemotron has the lowest VRAM footprint of any model here

At batch=1 frame=0.5s (n02), Nemotron uses 1,713 MB — lower than Whisper's 2,093 MB baseline, and dramatically lower than Parakeet's 5,143 MB. At batch=16 (n09), it uses only 3,267 MB.

On a 23 GB L4, a single card can serve 13 concurrent Nemotron instances at batch=1. For multi-tenant streaming ASR infrastructure, this is a significant operational advantage.

Finding 3: bf16 introduces a small but consistent WER regression

fp16 configs land at 10.30% WER across the board. bf16 configs land at 10.41%–10.46%. The throughput difference between fp16 and bf16 at the same batch size is marginal (~2%). fp16 is the better default for Nemotron — the WER regression in bf16 is not offset by any meaningful speed gain.

6. Power and Energy

Model	Config	Avg Power	VRAM	Energy / hr audio
Whisper	baseline w01	68.6 W	2,093 MB	2.00 Wh/hrA
Whisper	bf16 batch-4 w10	64.9 W	2,299 MB	1.58 Wh/hrA
Whisper	INT8 bs=1 w08	50.9 W	1,207 MB	3.25 Wh/hrA
Parakeet	baseline p01	39.0 W	5,171 MB	0.487 Wh/hrA
Parakeet	bf16 batch-8 p09	50.9 W	5,151 MB	0.213 Wh/hrA
Nemotron	bs=1 frame=0.5s n02	41.5 W	1,713 MB	0.477 Wh/hrA
Nemotron	fp16 batch-8 frame=0.5s n11	48.3 W	2,785 MB	0.187 Wh/hrA
Nemotron	bf16 batch-8 n16	45.7 W	2,773 MB	0.181 Wh/hrA

Parakeet and Nemotron at their optimal configs consume roughly 8–10× less energy per hour of audio than Whisper at baseline. The gap reflects architecture: Whisper's autoregressive decoder runs generation for every audio chunk. NeMo's transducer/streaming decoders are single-pass — the encoder runs once, the decoder is lightweight. More throughput per watt is not a coincidence; it is a structural property of the decoder design.

7. VRAM Headroom on 23 GB

Model	Config	Peak VRAM	Free on L4	Concurrent instances
Whisper INT8 bs=1	w08	1,207 MB	~21.8 GB	~18
Nemotron bs=1 frame=0.5s	n02	1,713 MB	~21.3 GB	~13
Whisper baseline	w01	2,093 MB	~20.9 GB	~11
Nemotron batch-8	n11	2,785 MB	~20.2 GB	~8
Whisper bf16 batch-4	w10	2,299 MB	~20.7 GB	~10
Parakeet bf16 batch-8	p09	5,151 MB	~17.9 GB	~4

Nemotron's compact footprint makes the L4 viable for multi-tenant streaming deployments that would otherwise require a larger card.

8. What We Didn't Test

A few things outside the scope of this benchmark that are worth noting:

Multi-GPU setups — all results are single L4. Tensor parallel configs for NeMo models on 2× or 4× GPU are not covered here.
Real-world audio — LibriSpeech is clean, structured read speech. Results on spontaneous conversational audio, phone-quality recordings, or heavy background noise would differ, especially for NeMo models.
Sustained load — we measured throughput on fixed batches. Latency under continuous concurrent stream load (10+ parallel streams) was not tested.
Multilingual — Whisper supports 99 languages. Parakeet and Nemotron are English-only. Cross-language accuracy was not measured.
TTFT (time to first token) — for streaming deployments, the latency to the first transcription output matters independently of throughput. Not measured here.

9. Summary and Decision Guide

Best Config Per Model

Model	Best config	WER	Throughput	VRAM	Energy/hr audio
Whisper	bf16 batch-4 sdpa (w10)	8.93%	41.1×	2,299 MB	1.58 Wh/hrA
Whisper (best accuracy)	bf16 beam=4 (w19)	8.11%	26.9×	2,121 MB	2.64 Wh/hrA
Parakeet	bf16 batch-8 (p09)	15.72%*	238.9×	5,151 MB	0.213 Wh/hrA
Nemotron	fp16 batch-8 frame=0.5s (n11)	10.30%*	258.9×	2,785 MB	0.187 Wh/hrA

Decision Guide by Use Case

Use case	Model + Config
High-throughput batch transcription	Parakeet bf16 batch=8
Max throughput + lowest energy	Nemotron fp16 batch=8 frame=0.5s
Noisy / accented / diverse audio	Whisper bf16 beam=4 sdpa
Live voice agent (low latency)	Nemotron fp16 batch=1 frame=0.1s
Multi-stream on single GPU	Nemotron fp16 batch=1 (13 concurrent on L4)
VRAM-constrained, need Whisper	Whisper INT8 batch=8

Decision Guide by Organisation Type

Startups and early-stage teams are usually optimising for one thing: cost per transcription hour while keeping quality acceptable. The L4 is the right card here — lower hourly rate than A100-class GPUs, Ada Lovelace architecture, 23 GB fits all three models comfortably. On a single L4, Nemotron at batch=8 delivers 258× real-time at 0.187 Wh per hour of audio. You can process an enormous volume of audio before the compute cost becomes significant. Parakeet is the alternative if accuracy on clean English is sufficient — 238× real-time at even lower energy per audio-hour. Both run on one card with no multi-GPU setup required.

Enterprises running production transcription pipelines — call centres, legal transcription, media captioning — typically have two concerns: accuracy on real-world audio and the ability to scale without re-architecting. Whisper is the right default for diverse, noisy, or accented audio given its broader training data. At bf16 batch=4 with SDPA, it runs at 41× real-time on a single L4 — meaning a small cluster of L4 instances handles significant concurrent load. For workloads where audio quality is controlled (internal meetings, studio recordings), swapping to Parakeet or Nemotron cuts energy cost by 7–8× with no accuracy regression on clean speech. The L4's 23 GB also allows running multiple smaller model instances per card — Nemotron's 1,713 MB footprint means up to 13 concurrent streams on one GPU, which maps well to enterprise multi-tenant deployments where you're serving many teams from shared infrastructure.

Teams building real-time voice products — voice agents, live captioning, real-time translation pipelines — need a model whose latency profile is tunable independently of accuracy. Nemotron is purpose-built for this. The cache-aware architecture means you set frame length based on your latency target, not based on accuracy constraints. At frame=0.1s, p50 latency is 86ms per utterance. At frame=0.5s, throughput jumps to 258× real-time with the same 10.30% WER. No other model in this benchmark offers that flexibility.

Conclusion

58 configurations. Three models. One NVIDIA L4.

The results that matter most: SDPA gives Whisper a free 1.8× throughput improvement that most deployments are leaving on the table. Beam search makes Parakeet strictly worse. Whisper chunk=10s silently destroys accuracy. Nemotron's WER is completely invariant to frame length — which means latency and throughput are independently tunable at zero accuracy cost.

The broader point is what the L4 makes possible. This is not an A100 or an H100. It is a 72W data-centre efficiency card with 23 GB of VRAM — and it ran all three models across 58 configs, with the best configs delivering 238–258× real-time throughput at under 55W average power draw. For startups watching compute spend closely, and for enterprises trying to run sustained transcription workloads without over-provisioning GPU capacity, the L4 hits a practical price-to-performance point that larger cards don't.

We ran everything — environment setup, model downloads, all 58 inference runs, metric collection — on a single E2E Cloud L4 instance. The same instance is available on-demand. If you want to reproduce these results or run your own model comparisons, the setup takes under 10 minutes on TIR.

Benchmark conducted on E2E Cloud TIR — NVIDIA L4, CUDA 12.4, PyTorch 2.6.0, NeMo 2.7.0, Transformers 4.43+

Benchmarking Open ASR Models on NVIDIA L4: Parakeet vs Whisper vs Nemotron Speech

Get ₹2,000 free credits to test your AI workloads

The Problem with Default ASR Configurations

1. The Three Models

Get ₹2,000 free credits to test your AI workloads

2. Test Setup

Metrics Collected Per Config

Configurations Tested

3. Whisper Results and Findings

Full Results

Finding 1: SDPA vs Eager — 1.8× throughput, zero accuracy cost

Finding 2: Batch size saturates at 4, not 16

Finding 3: Chunk-10 is an accuracy trap

Finding 4: condition_on_prev_tokens has no measurable effect

Finding 5: INT8 is a VRAM play, not a throughput play

4. Parakeet Results and Findings

Full Results

Finding 1: The throughput curve peaks at batch=8 and declines after

Finding 2: Beam search on Parakeet is strictly worse on all axes

Finding 3: FP32 provides no accuracy benefit

Finding 4: bf16 batch=8 is the optimal config

5. Nemotron Results and Findings

Full Results

Finding 1: Frame length has zero effect on accuracy

Finding 2: Nemotron has the lowest VRAM footprint of any model here

Finding 3: bf16 introduces a small but consistent WER regression

6. Power and Energy

7. VRAM Headroom on 23 GB

8. What We Didn't Test

9. Summary and Decision Guide

Best Config Per Model

Decision Guide by Use Case

Decision Guide by Organisation Type

Conclusion

Get ₹2,000 free credits to test your AI workloads

Related Articles

E2E goes live with next-generation NVIDIA B200 cluster deployed using NVIDIA Certified Reference Architecture

Running AI at Scale: The Infrastructure Reality Nobody Talks About

Scaling AI in production: What Nobody Tells You