Open-weight ASR has reached a point where the model choice is only half the decision. The other half is configuration — and most teams get it wrong by default.
Batch size, precision, attention implementation, chunk length, decoding strategy: each of these parameters affects throughput, accuracy, and memory usage in ways that are not obvious and not consistent across models. What works for Whisper does not apply to Parakeet. What is optimal on an A100 is not optimal on an L4.
We wanted concrete answers for the hardware we offer. So we took the three most widely used open ASR models, spun up an NVIDIA L4 (23 GB) instance on E2E Cloud, and ran 58 configurations — baseline through every meaningful software optimization — measuring WER, throughput, latency, VRAM, and power draw for each one.
This post is what we found.
Key Takeaways
-
Parakeet bf16 batch=8 hits 238× real-time throughput — one hour of audio in 15 seconds
-
Nemotron hits 258× real-time with WER that never moves across any configuration
-
SDPA attention gives Whisper a 1.8× free throughput improvement — most deployments aren't using it
-
Beam search on Parakeet is strictly counterproductive: 2× slower with no accuracy gain
-
Whisper chunk=10s causes a 3.5% absolute WER regression — a hidden accuracy trap
-
All three models fit on 23 GB with significant headroom
Get ₹2,000 free credits to test your AI workloads
Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.
The Problem with Default ASR Configurations
Every ASR library ships with defaults. Those defaults are reasonable starting points, but they are not production configurations.
The parameters that actually move the needle are not obvious:
-
Batch size — the single most impactful lever, but the optimal value is hardware-specific and model-specific. What works on an A100 does not work the same way on an L4.
-
Attention implementation — sdpa vs eager in Transformers. One line of code. 1.8× throughput difference on Whisper.
-
Precision — fp16 vs bf16 vs fp32. The accuracy impact varies by model and is not always what you expect.
-
Chunk length — for Whisper on long audio. Reducing it for lower latency causes a silent accuracy regression most benchmarks never report.
-
Decoding strategy — beam search vs greedy. On encoder-decoder models like Whisper, the trade-off is real. On transducer models like Parakeet, beam search is a liability.
-
Streaming frame length — for Nemotron. Turns out this is a pure latency knob with zero accuracy effect.
These parameters interact. A larger batch size with eager attention is slower than a smaller batch size with SDPA. You need to test combinations, not individual knobs in isolation.
That is what this benchmark does.
1. The Three Models
| Model | Parameters | Architecture | Family |
|---|---|---|---|
| openai/whisper-large-v3-turbo | 809 M | Encoder-Decoder Transformer | OpenAI Whisper |
| nvidia/parakeet-tdt-0.6b-v3 | 600 M | FastConformer + TDT Decoder | NVIDIA NeMo |
| nvidia/nemotron-speech-streaming-en-0.6b | 600 M | Cache-Aware FastConformer + RNNT | NVIDIA NeMo |
Whisper-large-v3-turbo is OpenAI's distilled variant of large-v3 — faster, with a marginal accuracy trade-off. It is the most robust of the three on noisy and accented audio, reflecting its training on hundreds of thousands of hours of diverse web audio.
Parakeet-TDT-0.6b-v3 uses NVIDIA's Token-and-Duration Transducer decoder on top of a FastConformer encoder. It is designed specifically for high-throughput offline transcription and its numbers reflect that.
Nemotron-Speech-Streaming-en-0.6b is architecturally different from the other two. Its Cache-Aware FastConformer encodes each audio frame exactly once and reuses the cached state across steps. This means chunk size does not affect how much acoustic context the model sees — only when it sees it. The practical implication is significant and shows up directly in the benchmark results.
Get ₹2,000 free credits to test your AI workloads
Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.
2. Test Setup
GPU: NVIDIA L4 — 23,034 MiB VRAM
CUDA: 12.4
PyTorch: 2.6.0
Stack: HuggingFace Transformers 4.43+ | NVIDIA NeMo 2.7.0
Dataset: LibriSpeech test-clean (50 samples) + test-other (50 samples)
Metrics Collected Per Config
| Metric | Definition |
|---|---|
| WER | Word Error Rate vs LibriSpeech ground truth |
| CER | Character Error Rate |
| RTF | inference_time / audio_duration — lower is better |
| Throughput | Real-time multiplier |
| p50 Latency | Median per-utterance inference time |
| Peak VRAM | Maximum GPU memory (nvidia-smi polling) |
| Avg Power | Mean watts during inference run |
| Energy / hr audio | Wh consumed per hour of audio processed |
Configurations Tested
| Dimension | Whisper | Parakeet | Nemotron |
|---|---|---|---|
| Batch size | 1, 4, 8, 16, 32 | 1, 4, 8, 16, 32, 64 | 1, 2, 4, 8, 16 |
| Precision | fp16, bf16, int8 | fp16, bf16, fp32 | fp16, bf16 |
| Attention | sdpa, eager | — | — |
| Beam size | 1, 4, 8 | 1, 4 | — |
| Chunk / frame length | 10s, 20s, 30s | — | 0.1s, 0.5s, 1s, 2s, 4s |
| condition_on_prev | True, False | — | — |
| return_timestamps | True, False | — | — |
| Total configs | 20 | 20 | 18 |
Note on WER for NeMo models: Parakeet and Nemotron output punctuated text; LibriSpeech references are plain text. NeMo WER is inflated by approximately 2–3% absolute vs post-normalised scores. Comparisons within each model are fully valid; cross-model WER comparison should account for this.
3. Whisper Results and Findings
Full Results
| Config | WER | RTF | Throughput | p50 Lat | VRAM | Power | Energy |
|---|---|---|---|---|---|---|---|
| w01 — baseline (bs=1, fp16, sdpa, chunk=30) | 8.93% | 0.0292 | 34.2× | 0.198s | 2,093 MB | 68.6 W | 2.00 Wh/hrA |
| w02 — accurate real-time (bs=1, beam=4) | 8.38% | 0.0370 | 27.1× | 0.244s | 2,113 MB | 69.1 W | 2.56 Wh/hrA |
| w03 — batch-4 greedy (fp16, sdpa) | 8.93% | 0.0248 | 40.4× | 0.176s | 2,291 MB | 70.3 W | 1.74 Wh/hrA |
| w04 — batch-8 greedy | 8.98% | 0.0272 | 36.7× | 0.194s | 2,719 MB | 68.1 W | 1.85 Wh/hrA |
| w05 — batch-16 greedy | 8.93% | 0.0270 | 37.1× | 0.193s | 3,195 MB | 66.1 W | 1.78 Wh/hrA |
| w06 — batch-32 greedy | 8.93% | 0.0277 | 36.1× | 0.201s | 3,945 MB | 65.4 W | 1.81 Wh/hrA |
| w07 — max accuracy (bs=1, beam=8) | 8.43% | 0.0471 | 21.2× | 0.311s | 2,303 MB | 69.7 W | 3.28 Wh/hrA |
| w08 — INT8 cost-optimised (bs=1) | 8.43% | 0.0638 | 15.7× | 0.423s | 1,207 MB | 50.9 W | 3.25 Wh/hrA |
| w09 — INT8 batch-8 | 8.76% | 0.0420 | 23.8× | 0.304s | 1,867 MB | 61.6 W | 2.59 Wh/hrA |
| w10 — bf16 batch-4 | 8.93% | 0.0243 | 41.1× | 0.173s | 2,299 MB | 64.9 W | 1.58 Wh/hrA |
| w11 — bf16 batch-8 | 8.98% | 0.0266 | 37.6× | 0.191s | 2,727 MB | 65.2 W | 1.74 Wh/hrA |
| w12 — eager attn (bs=4, fp16) | 8.93% | 0.0449 | 22.3× | 0.324s | 3,589 MB | 69.3 W | 3.11 Wh/hrA |
| w13 — chunk-10 (bs=4) | 12.43% ⚠️ | 0.0328 | 30.5× | 0.234s | 2,373 MB | 67.0 W | 2.20 Wh/hrA |
| w14 — chunk-20 (bs=4) | 8.60% | 0.0266 | 37.7× | 0.186s | 2,297 MB | 68.4 W | 1.82 Wh/hrA |
| w15 — batch-8 beam-4 balanced | 8.38% | 0.0388 | 25.8× | 0.276s | 3,873 MB | 69.5 W | 2.70 Wh/hrA |
| w16 — timestamps (bs=4) | 8.21% | 0.0268 | 37.3× | 0.193s | 2,299 MB | 71.5 W | 1.92 Wh/hrA |
| w17 — max throughput bf16 (bs=32) | 8.87% | 0.0269 | 37.2× | 0.198s | 3,953 MB | 66.2 W | 1.78 Wh/hrA |
| w18 — condition_on_prev (bs=4) | 8.93% | 0.0257 | 38.9× | 0.186s | 2,299 MB | 72.3 W | 1.86 Wh/hrA |
| w19 — bf16 accurate (bs=1, beam=4) | 8.11% | 0.0371 | 26.9× | 0.248s | 2,121 MB | 71.1 W | 2.64 Wh/hrA |
| w20 — batch-16 beam-4 | 8.38% | 0.0402 | 24.9× | 0.292s | 5,411 MB | 67.5 W | 2.72 Wh/hrA |
Finding 1: SDPA vs Eager — 1.8× throughput, zero accuracy cost
Comparing w03 (SDPA, batch=4, fp16) and w12 (eager, batch=4, fp16) isolates the attention implementation with everything else held constant. SDPA runs at 40.4× real-time; eager runs at 22.3×. VRAM drops from 3,589 MB to 2,291 MB. WER is identical at 8.93%.
The reason: PyTorch's SDPA dispatches to FlashAttention-2 kernels internally when available, fusing the attention computation into fewer GPU operations. Eager mode executes attention as separate matrix multiplications with intermediate activations stored in VRAM. The difference is not marginal — it is 1.8× throughput for one parameter change.
Most Whisper deployments use attn_implementation='eager' because it is the older default. Switching to 'sdpa' is the first change any Whisper deployment should make.
Finding 2: Batch size saturates at 4, not 16
Throughput peaks at batch=4 (40.4×) and declines at larger batch sizes — batch=8 drops to 36.7×, batch=16 to 37.1×, batch=32 to 36.1×. The L4's memory bandwidth is the constraint. At batch=4, the GPU is fully utilised. Beyond that, the overhead of managing larger batches through Whisper's chunked pipeline outweighs the parallelism benefit.
This is different from what you'd see on an L40S or A100, where the saturation point is higher. The optimal Whisper batch size is hardware-specific — do not copy configs from larger-GPU benchmarks directly.
Finding 3: Chunk-10 is an accuracy trap
Reducing chunk length from 30s to 10s causes WER to jump from 8.93% to 12.43% — a 3.5% absolute regression. Chunk=20 is much safer at 8.60%. The throughput gain from chunk=10 (30.5×) does not compensate for the accuracy loss, especially when batch=4 at chunk=30 already achieves 40.4× at 8.93% WER.
The cause: Whisper's encoder needs sufficient context to resolve ambiguous phonemes and cross-word boundaries. At 10s windows, many utterances are truncated mid-phrase, and the model's conditioning on previous chunks fails to fully recover. Do not reduce Whisper chunk size for throughput — use batch size instead.
Finding 4: condition_on_prev_tokens has no measurable effect
w18 (condition_on_prev=True, bs=4) and w03 (condition_on_prev=False, bs=4) produce identical WER (8.93%) and near-identical throughput (38.9× vs 40.4×). On LibriSpeech — clean, structured read speech — cross-chunk context conditioning adds nothing. On highly conversational or fragmented audio, the result might differ. For clean speech workloads, this parameter does not matter.
Finding 5: INT8 is a VRAM play, not a throughput play
INT8 batch=1 (w08) achieves the lowest VRAM of any Whisper config: 1,207 MB. But throughput drops to 15.7× — less than half the fp16 baseline. At batch=8 with INT8 (w09), throughput recovers to 23.8× with VRAM at 1,867 MB.
INT8 is the right choice when you need to pack multiple inference workers onto a single GPU. It is not the right choice for maximising single-stream throughput.
4. Parakeet Results and Findings
Full Results
| Config | WER | RTF | Throughput | p50 Lat | VRAM | Power | Energy |
|---|---|---|---|---|---|---|---|
| p01 — baseline (bs=1, fp16, greedy) | 15.77% | 0.0125 | 79.9× | 0.087s | 5,171 MB | 39.0 W | 0.487 Wh/hrA |
| p02 — batch-4, fp16 | 15.77% | 0.0050 | 200.2× | 0.033s | 5,147 MB | 45.9 W | 0.229 Wh/hrA |
| p03 — batch-8, fp16 | 15.72% | 0.0044 | 228.3× | 0.031s | 5,143 MB | 50.0 W | 0.219 Wh/hrA |
| p04 — batch-16, fp16 | 15.77% | 0.0046 | 218.0× | 0.034s | 5,143 MB | 51.6 W | 0.237 Wh/hrA |
| p05 — batch-32, fp16 | 15.77% | 0.0047 | 214.7× | 0.035s | 5,143 MB | 53.1 W | 0.247 Wh/hrA |
| p06 — batch-64 max | 15.77% | 0.0049 | 204.8× | 0.037s | 7,143 MB | 54.4 W | 0.266 Wh/hrA |
| p07 — bf16 baseline | 15.72% | 0.0120 | 83.0× | 0.088s | 5,153 MB | 43.8 W | 0.527 Wh/hrA |
| p08 — bf16 batch-4 | 15.77% | 0.0048 | 207.1× | 0.031s | 5,155 MB | 48.0 W | 0.232 Wh/hrA |
| p09 — bf16 batch-8 | 15.72% | 0.0042 | 238.9× | 0.029s | 5,151 MB | 50.9 W | 0.213 Wh/hrA |
| p10 — bf16 batch-16 | 15.77% | 0.0044 | 225.6× | 0.032s | 5,151 MB | 55.5 W | 0.246 Wh/hrA |
| p11 — bf16 batch-32 | 15.72% | 0.0045 | 219.8× | 0.034s | 5,151 MB | 54.0 W | 0.246 Wh/hrA |
| p12 — fp32 max accuracy | 15.72% | 0.0118 | 84.9× | 0.088s | 5,135 MB | 51.8 W | 0.610 Wh/hrA |
| p13 — fp32 batch-4 | 15.72% | 0.0052 | 191.7× | 0.036s | 5,139 MB | 58.3 W | 0.304 Wh/hrA |
| p14 — beam-4 (bs=1) | 15.83% ⚠️ | 0.0273 | 36.6× | 0.165s | 5,171 MB | 39.5 W | 1.079 Wh/hrA |
| p15 — batch-4 beam-4 | 15.94% ⚠️ | 0.0205 | 48.8× | 0.133s | 5,147 MB | 40.6 W | 0.831 Wh/hrA |
| p16 — batch-8 beam-4 | 15.88% ⚠️ | 0.0197 | 50.7× | 0.136s | 5,147 MB | 41.7 W | 0.822 Wh/hrA |
| p17 — batch-16 beam-4 | 15.99% ⚠️ | 0.0199 | 50.2× | 0.146s | 5,143 MB | 42.4 W | 0.844 Wh/hrA |
| p18 — bf16 beam-4 (bs=1) | 15.77% | 0.0200 | 50.1× | 0.131s | 5,155 MB | 42.2 W | 0.843 Wh/hrA |
| p19 — bf16 batch-8 beam-4 | 15.61% | 0.0194 | 51.4× | 0.134s | 5,155 MB | 41.4 W | 0.805 Wh/hrA |
| p20 — bf16 batch-32 | 15.72% | 0.0047 | 214.7× | 0.034s | 7,149 MB | 53.8 W | 0.251 Wh/hrA |
Finding 1: The throughput curve peaks at batch=8 and declines after
Going from batch=1 to batch=8 in fp16, throughput climbs from 79.9× to 228.3×. Then it declines: batch=16 drops to 218.0×, batch=32 to 214.7×, batch=64 to 204.8×. VRAM stays nearly constant from batch=1 through batch=32 (~5,143 MB), then jumps at batch=64 (7,143 MB) — without any throughput benefit.
The reason: NeMo's internal batching and dispatcher overhead increases with batch size. At batch=8, the GPU is fully saturated. Beyond that, the overhead of preparing, padding, and scheduling larger batches through the TDT decoder outweighs the additional parallelism. batch=8 wins on throughput, VRAM, and energy simultaneously.
Finding 2: Beam search on Parakeet is strictly worse on all axes
This is the clearest negative result in the benchmark. Beam=4 at batch=1 (p14) runs at 36.6× — less than half the greedy throughput at the same batch size (79.9×). Beam=4 WER (15.83%) is higher than greedy WER (15.77%). Every beam search config underperforms the equivalent greedy config on both speed and accuracy.
Why: TDT decoders predict both tokens and their durations jointly. The blank token mechanism already implicitly prunes low-probability paths during greedy decoding. Beam search explores alternative paths that the model's joint scoring function cannot meaningfully rank — producing no accuracy gain while multiplying decode time. Never use beam search with Parakeet.
Finding 3: FP32 provides no accuracy benefit
p12 (fp32, bs=1) and p01 (fp16, bs=1) both land at 15.72–15.77% WER. p12 draws 51.8W vs 39.0W for p01. FP32 adds 33% more power consumption with zero accuracy return on this model. Use fp16 or bf16 — fp32 is wasted compute for Parakeet.
Finding 4: bf16 batch=8 is the optimal config
p09 (bf16 batch=8) achieves 238.9× real-time at 50.9W and 0.213 Wh per hour of audio — the best combination of throughput, energy, and VRAM in the entire Parakeet grid.
5. Nemotron Results and Findings
Full Results
| Config | WER | RTF | Throughput | p50 Lat | VRAM | Power | Energy |
|---|---|---|---|---|---|---|---|
| n01 — bs=1, fp16, frame=0.1s | 10.30% | 0.0119 | 83.8× | 0.086s | 1,917 MB | 39.6 W | 0.472 Wh/hrA |
| n02 — bs=1, fp16, frame=0.5s | 10.30% | 0.0115 | 86.9× | 0.088s | 1,713 MB | 41.5 W | 0.477 Wh/hrA |
| n03 — bs=1, fp16, frame=1.0s | 10.30% | 0.0115 | 87.2× | 0.084s | 1,753 MB | 44.0 W | 0.505 Wh/hrA |
| n04 — bs=1, fp16, frame=2.0s | 10.30% | 0.0114 | 87.4× | 0.087s | 1,793 MB | 45.6 W | 0.521 Wh/hrA |
| n05 — bs=1, fp16, frame=4.0s | 10.30% | 0.0114 | 87.8× | 0.087s | 1,833 MB | 47.3 W | 0.539 Wh/hrA |
| n06 — batch-2, fp16, frame=1.0s | 10.30% | 0.0063 | 158.0× | 0.046s | 1,955 MB | 44.1 W | 0.279 Wh/hrA |
| n07 — batch-4, fp16, frame=1.0s | 10.30% | 0.0049 | 202.5× | 0.025s | 2,195 MB | 44.7 W | 0.221 Wh/hrA |
| n08 — batch-8, fp16, frame=1.0s | 10.30% | 0.0045 | 221.7× | 0.023s | 2,565 MB | 49.1 W | 0.221 Wh/hrA |
| n09 — batch-16, fp16, frame=1.0s | 10.30% | 0.0040 | 247.5× | 0.020s | 3,267 MB | 48.3 W | 0.195 Wh/hrA |
| n10 — batch-4, fp16, frame=0.5s | 10.30% | 0.0046 | 219.6× | 0.023s | 2,435 MB | 40.9 W | 0.186 Wh/hrA |
| n11 — batch-8, fp16, frame=0.5s | 10.30% | 0.0039 | 258.9× | 0.018s | 2,785 MB | 48.3 W | 0.187 Wh/hrA |
| n12 — batch-4, fp16, frame=2.0s | 10.30% | 0.0045 | 220.2× | 0.023s | 2,515 MB | 44.2 W | 0.201 Wh/hrA |
| n13 — batch-8, fp16, frame=2.0s | 10.30% | 0.0039 | 258.5× | 0.017s | 2,805 MB | 49.1 W | 0.190 Wh/hrA |
| n14 — bs=1, bf16, frame=1.0s | 10.46% | 0.0115 | 87.0× | 0.088s | 2,241 MB | 48.7 W | 0.560 Wh/hrA |
| n15 — batch-4, bf16, frame=1.0s | 10.41% | 0.0047 | 211.9× | 0.024s | 2,483 MB | 51.6 W | 0.244 Wh/hrA |
| n16 — batch-8, bf16, frame=1.0s | 10.46% | 0.0040 | 253.1× | 0.018s | 2,773 MB | 45.7 W | 0.181 Wh/hrA |
| n17 — batch-4, bf16, frame=0.5s | 10.41% | 0.0046 | 215.7× | 0.024s | 2,483 MB | 52.7 W | 0.244 Wh/hrA |
| n18 — batch-8, bf16, frame=2.0s | 10.46% | 0.0039 | 256.4× | 0.018s | 2,773 MB | 48.0 W | 0.187 Wh/hrA |
Finding 1: Frame length has zero effect on accuracy
WER is 10.30% across every single fp16 configuration — n01 through n13, from frame=0.1s to frame=4.0s, from batch=1 to batch=16. Not a single decimal point of change.
This is the cache-aware architecture working exactly as designed. Because the encoder processes each audio frame once and reuses its cached state, the model's acoustic representation does not change with chunk size. Frame length is a pure latency knob — you can tune it freely without touching accuracy.
For a live voice agent needing sub-20ms latency, use n11 (batch=8, frame=0.5s, p50=0.018s). For overnight batch transcription, use n09 (batch=16, frame=1.0s, 247.5× RT). Same weights, same accuracy, completely different operational profile.
Finding 2: Nemotron has the lowest VRAM footprint of any model here
At batch=1 frame=0.5s (n02), Nemotron uses 1,713 MB — lower than Whisper's 2,093 MB baseline, and dramatically lower than Parakeet's 5,143 MB. At batch=16 (n09), it uses only 3,267 MB.
On a 23 GB L4, a single card can serve 13 concurrent Nemotron instances at batch=1. For multi-tenant streaming ASR infrastructure, this is a significant operational advantage.
Finding 3: bf16 introduces a small but consistent WER regression
fp16 configs land at 10.30% WER across the board. bf16 configs land at 10.41%–10.46%. The throughput difference between fp16 and bf16 at the same batch size is marginal (~2%). fp16 is the better default for Nemotron — the WER regression in bf16 is not offset by any meaningful speed gain.
6. Power and Energy
| Model | Config | Avg Power | VRAM | Energy / hr audio |
|---|---|---|---|---|
| Whisper | baseline w01 | 68.6 W | 2,093 MB | 2.00 Wh/hrA |
| Whisper | bf16 batch-4 w10 | 64.9 W | 2,299 MB | 1.58 Wh/hrA |
| Whisper | INT8 bs=1 w08 | 50.9 W | 1,207 MB | 3.25 Wh/hrA |
| Parakeet | baseline p01 | 39.0 W | 5,171 MB | 0.487 Wh/hrA |
| Parakeet | bf16 batch-8 p09 | 50.9 W | 5,151 MB | 0.213 Wh/hrA |
| Nemotron | bs=1 frame=0.5s n02 | 41.5 W | 1,713 MB | 0.477 Wh/hrA |
| Nemotron | fp16 batch-8 frame=0.5s n11 | 48.3 W | 2,785 MB | 0.187 Wh/hrA |
| Nemotron | bf16 batch-8 n16 | 45.7 W | 2,773 MB | 0.181 Wh/hrA |
Parakeet and Nemotron at their optimal configs consume roughly 8–10× less energy per hour of audio than Whisper at baseline. The gap reflects architecture: Whisper's autoregressive decoder runs generation for every audio chunk. NeMo's transducer/streaming decoders are single-pass — the encoder runs once, the decoder is lightweight. More throughput per watt is not a coincidence; it is a structural property of the decoder design.
7. VRAM Headroom on 23 GB
| Model | Config | Peak VRAM | Free on L4 | Concurrent instances |
|---|---|---|---|---|
| Whisper INT8 bs=1 | w08 | 1,207 MB | ~21.8 GB | ~18 |
| Nemotron bs=1 frame=0.5s | n02 | 1,713 MB | ~21.3 GB | ~13 |
| Whisper baseline | w01 | 2,093 MB | ~20.9 GB | ~11 |
| Nemotron batch-8 | n11 | 2,785 MB | ~20.2 GB | ~8 |
| Whisper bf16 batch-4 | w10 | 2,299 MB | ~20.7 GB | ~10 |
| Parakeet bf16 batch-8 | p09 | 5,151 MB | ~17.9 GB | ~4 |
Nemotron's compact footprint makes the L4 viable for multi-tenant streaming deployments that would otherwise require a larger card.
8. What We Didn't Test
A few things outside the scope of this benchmark that are worth noting:
-
Multi-GPU setups — all results are single L4. Tensor parallel configs for NeMo models on 2× or 4× GPU are not covered here.
-
Real-world audio — LibriSpeech is clean, structured read speech. Results on spontaneous conversational audio, phone-quality recordings, or heavy background noise would differ, especially for NeMo models.
-
Sustained load — we measured throughput on fixed batches. Latency under continuous concurrent stream load (10+ parallel streams) was not tested.
-
Multilingual — Whisper supports 99 languages. Parakeet and Nemotron are English-only. Cross-language accuracy was not measured.
-
TTFT (time to first token) — for streaming deployments, the latency to the first transcription output matters independently of throughput. Not measured here.
9. Summary and Decision Guide
Best Config Per Model
| Model | Best config | WER | Throughput | VRAM | Energy/hr audio |
|---|---|---|---|---|---|
| Whisper | bf16 batch-4 sdpa (w10) | 8.93% | 41.1× | 2,299 MB | 1.58 Wh/hrA |
| Whisper (best accuracy) | bf16 beam=4 (w19) | 8.11% | 26.9× | 2,121 MB | 2.64 Wh/hrA |
| Parakeet | bf16 batch-8 (p09) | 15.72%* | 238.9× | 5,151 MB | 0.213 Wh/hrA |
| Nemotron | fp16 batch-8 frame=0.5s (n11) | 10.30%* | 258.9× | 2,785 MB | 0.187 Wh/hrA |
Decision Guide by Use Case
| Use case | Model + Config |
|---|---|
| High-throughput batch transcription | Parakeet bf16 batch=8 |
| Max throughput + lowest energy | Nemotron fp16 batch=8 frame=0.5s |
| Noisy / accented / diverse audio | Whisper bf16 beam=4 sdpa |
| Live voice agent (low latency) | Nemotron fp16 batch=1 frame=0.1s |
| Multi-stream on single GPU | Nemotron fp16 batch=1 (13 concurrent on L4) |
| VRAM-constrained, need Whisper | Whisper INT8 batch=8 |
Decision Guide by Organisation Type
Startups and early-stage teams are usually optimising for one thing: cost per transcription hour while keeping quality acceptable. The L4 is the right card here — lower hourly rate than A100-class GPUs, Ada Lovelace architecture, 23 GB fits all three models comfortably. On a single L4, Nemotron at batch=8 delivers 258× real-time at 0.187 Wh per hour of audio. You can process an enormous volume of audio before the compute cost becomes significant. Parakeet is the alternative if accuracy on clean English is sufficient — 238× real-time at even lower energy per audio-hour. Both run on one card with no multi-GPU setup required.
Enterprises running production transcription pipelines — call centres, legal transcription, media captioning — typically have two concerns: accuracy on real-world audio and the ability to scale without re-architecting. Whisper is the right default for diverse, noisy, or accented audio given its broader training data. At bf16 batch=4 with SDPA, it runs at 41× real-time on a single L4 — meaning a small cluster of L4 instances handles significant concurrent load. For workloads where audio quality is controlled (internal meetings, studio recordings), swapping to Parakeet or Nemotron cuts energy cost by 7–8× with no accuracy regression on clean speech. The L4's 23 GB also allows running multiple smaller model instances per card — Nemotron's 1,713 MB footprint means up to 13 concurrent streams on one GPU, which maps well to enterprise multi-tenant deployments where you're serving many teams from shared infrastructure.
Teams building real-time voice products — voice agents, live captioning, real-time translation pipelines — need a model whose latency profile is tunable independently of accuracy. Nemotron is purpose-built for this. The cache-aware architecture means you set frame length based on your latency target, not based on accuracy constraints. At frame=0.1s, p50 latency is 86ms per utterance. At frame=0.5s, throughput jumps to 258× real-time with the same 10.30% WER. No other model in this benchmark offers that flexibility.
Conclusion
58 configurations. Three models. One NVIDIA L4.
The results that matter most: SDPA gives Whisper a free 1.8× throughput improvement that most deployments are leaving on the table. Beam search makes Parakeet strictly worse. Whisper chunk=10s silently destroys accuracy. Nemotron's WER is completely invariant to frame length — which means latency and throughput are independently tunable at zero accuracy cost.
The broader point is what the L4 makes possible. This is not an A100 or an H100. It is a 72W data-centre efficiency card with 23 GB of VRAM — and it ran all three models across 58 configs, with the best configs delivering 238–258× real-time throughput at under 55W average power draw. For startups watching compute spend closely, and for enterprises trying to run sustained transcription workloads without over-provisioning GPU capacity, the L4 hits a practical price-to-performance point that larger cards don't.
We ran everything — environment setup, model downloads, all 58 inference runs, metric collection — on a single E2E Cloud L4 instance. The same instance is available on-demand. If you want to reproduce these results or run your own model comparisons, the setup takes under 10 minutes on TIR.
Benchmark conducted on E2E Cloud TIR — NVIDIA L4, CUDA 12.4, PyTorch 2.6.0, NeMo 2.7.0, Transformers 4.43+


