Benchmarking Open ASR Models on NVIDIA L4: Parakeet vs Whisper vs Nemotron Speech

EN
E2E Networks

Content Team @ E2E Networks

March 27, 2026·23 min read
Share this article
Link copied to clipboard

Open-weight ASR has reached a point where the model choice is only half the decision. The other half is configuration — and most teams get it wrong by default.

Batch size, precision, attention implementation, chunk length, decoding strategy: each of these parameters affects throughput, accuracy, and memory usage in ways that are not obvious and not consistent across models. What works for Whisper does not apply to Parakeet. What is optimal on an A100 is not optimal on an L4.

We wanted concrete answers for the hardware we offer. So we took the three most widely used open ASR models, spun up an NVIDIA L4 (23 GB) instance on E2E Cloud, and ran 58 configurations — baseline through every meaningful software optimization — measuring WER, throughput, latency, VRAM, and power draw for each one.

This post is what we found.

Key Takeaways

  • Parakeet bf16 batch=8 hits 238× real-time throughput — one hour of audio in 15 seconds

  • Nemotron hits 258× real-time with WER that never moves across any configuration

  • SDPA attention gives Whisper a 1.8× free throughput improvement — most deployments aren't using it

  • Beam search on Parakeet is strictly counterproductive: 2× slower with no accuracy gain

  • Whisper chunk=10s causes a 3.5% absolute WER regression — a hidden accuracy trap

  • All three models fit on 23 GB with significant headroom

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

The Problem with Default ASR Configurations

Every ASR library ships with defaults. Those defaults are reasonable starting points, but they are not production configurations.

The parameters that actually move the needle are not obvious:

  • Batch size — the single most impactful lever, but the optimal value is hardware-specific and model-specific. What works on an A100 does not work the same way on an L4.

  • Attention implementation — sdpa vs eager in Transformers. One line of code. 1.8× throughput difference on Whisper.

  • Precision — fp16 vs bf16 vs fp32. The accuracy impact varies by model and is not always what you expect.

  • Chunk length — for Whisper on long audio. Reducing it for lower latency causes a silent accuracy regression most benchmarks never report.

  • Decoding strategy — beam search vs greedy. On encoder-decoder models like Whisper, the trade-off is real. On transducer models like Parakeet, beam search is a liability.

  • Streaming frame length — for Nemotron. Turns out this is a pure latency knob with zero accuracy effect.

These parameters interact. A larger batch size with eager attention is slower than a smaller batch size with SDPA. You need to test combinations, not individual knobs in isolation.

That is what this benchmark does.

1. The Three Models

ModelParametersArchitectureFamily
openai/whisper-large-v3-turbo809 MEncoder-Decoder TransformerOpenAI Whisper
nvidia/parakeet-tdt-0.6b-v3600 MFastConformer + TDT DecoderNVIDIA NeMo
nvidia/nemotron-speech-streaming-en-0.6b600 MCache-Aware FastConformer + RNNTNVIDIA NeMo

Whisper-large-v3-turbo is OpenAI's distilled variant of large-v3 — faster, with a marginal accuracy trade-off. It is the most robust of the three on noisy and accented audio, reflecting its training on hundreds of thousands of hours of diverse web audio.

Parakeet-TDT-0.6b-v3 uses NVIDIA's Token-and-Duration Transducer decoder on top of a FastConformer encoder. It is designed specifically for high-throughput offline transcription and its numbers reflect that.

Nemotron-Speech-Streaming-en-0.6b is architecturally different from the other two. Its Cache-Aware FastConformer encodes each audio frame exactly once and reuses the cached state across steps. This means chunk size does not affect how much acoustic context the model sees — only when it sees it. The practical implication is significant and shows up directly in the benchmark results.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

2. Test Setup

GPU: NVIDIA L4 — 23,034 MiB VRAM

CUDA: 12.4

PyTorch: 2.6.0

Stack: HuggingFace Transformers 4.43+ | NVIDIA NeMo 2.7.0

Dataset: LibriSpeech test-clean (50 samples) + test-other (50 samples)

Metrics Collected Per Config

MetricDefinition
WERWord Error Rate vs LibriSpeech ground truth
CERCharacter Error Rate
RTFinference_time / audio_duration — lower is better
ThroughputReal-time multiplier
p50 LatencyMedian per-utterance inference time
Peak VRAMMaximum GPU memory (nvidia-smi polling)
Avg PowerMean watts during inference run
Energy / hr audioWh consumed per hour of audio processed

Configurations Tested

DimensionWhisperParakeetNemotron
Batch size1, 4, 8, 16, 321, 4, 8, 16, 32, 641, 2, 4, 8, 16
Precisionfp16, bf16, int8fp16, bf16, fp32fp16, bf16
Attentionsdpa, eager
Beam size1, 4, 81, 4
Chunk / frame length10s, 20s, 30s0.1s, 0.5s, 1s, 2s, 4s
condition_on_prevTrue, False
return_timestampsTrue, False
Total configs202018

Note on WER for NeMo models: Parakeet and Nemotron output punctuated text; LibriSpeech references are plain text. NeMo WER is inflated by approximately 2–3% absolute vs post-normalised scores. Comparisons within each model are fully valid; cross-model WER comparison should account for this.

3. Whisper Results and Findings

Full Results

ConfigWERRTFThroughputp50 LatVRAMPowerEnergy
w01 — baseline (bs=1, fp16, sdpa, chunk=30)8.93%0.029234.2×0.198s2,093 MB68.6 W2.00 Wh/hrA
w02 — accurate real-time (bs=1, beam=4)8.38%0.037027.1×0.244s2,113 MB69.1 W2.56 Wh/hrA
w03 — batch-4 greedy (fp16, sdpa)8.93%0.024840.4×0.176s2,291 MB70.3 W1.74 Wh/hrA
w04 — batch-8 greedy8.98%0.027236.7×0.194s2,719 MB68.1 W1.85 Wh/hrA
w05 — batch-16 greedy8.93%0.027037.1×0.193s3,195 MB66.1 W1.78 Wh/hrA
w06 — batch-32 greedy8.93%0.027736.1×0.201s3,945 MB65.4 W1.81 Wh/hrA
w07 — max accuracy (bs=1, beam=8)8.43%0.047121.2×0.311s2,303 MB69.7 W3.28 Wh/hrA
w08 — INT8 cost-optimised (bs=1)8.43%0.063815.7×0.423s1,207 MB50.9 W3.25 Wh/hrA
w09 — INT8 batch-88.76%0.042023.8×0.304s1,867 MB61.6 W2.59 Wh/hrA
w10 — bf16 batch-48.93%0.024341.1×0.173s2,299 MB64.9 W1.58 Wh/hrA
w11 — bf16 batch-88.98%0.026637.6×0.191s2,727 MB65.2 W1.74 Wh/hrA
w12 — eager attn (bs=4, fp16)8.93%0.044922.3×0.324s3,589 MB69.3 W3.11 Wh/hrA
w13 — chunk-10 (bs=4)12.43% ⚠️0.032830.5×0.234s2,373 MB67.0 W2.20 Wh/hrA
w14 — chunk-20 (bs=4)8.60%0.026637.7×0.186s2,297 MB68.4 W1.82 Wh/hrA
w15 — batch-8 beam-4 balanced8.38%0.038825.8×0.276s3,873 MB69.5 W2.70 Wh/hrA
w16 — timestamps (bs=4)8.21%0.026837.3×0.193s2,299 MB71.5 W1.92 Wh/hrA
w17 — max throughput bf16 (bs=32)8.87%0.026937.2×0.198s3,953 MB66.2 W1.78 Wh/hrA
w18 — condition_on_prev (bs=4)8.93%0.025738.9×0.186s2,299 MB72.3 W1.86 Wh/hrA
w19 — bf16 accurate (bs=1, beam=4)8.11%0.037126.9×0.248s2,121 MB71.1 W2.64 Wh/hrA
w20 — batch-16 beam-48.38%0.040224.9×0.292s5,411 MB67.5 W2.72 Wh/hrA

Finding 1: SDPA vs Eager — 1.8× throughput, zero accuracy cost

Comparing w03 (SDPA, batch=4, fp16) and w12 (eager, batch=4, fp16) isolates the attention implementation with everything else held constant. SDPA runs at 40.4× real-time; eager runs at 22.3×. VRAM drops from 3,589 MB to 2,291 MB. WER is identical at 8.93%.

The reason: PyTorch's SDPA dispatches to FlashAttention-2 kernels internally when available, fusing the attention computation into fewer GPU operations. Eager mode executes attention as separate matrix multiplications with intermediate activations stored in VRAM. The difference is not marginal — it is 1.8× throughput for one parameter change.

Most Whisper deployments use attn_implementation='eager' because it is the older default. Switching to 'sdpa' is the first change any Whisper deployment should make.

Finding 2: Batch size saturates at 4, not 16

Throughput peaks at batch=4 (40.4×) and declines at larger batch sizes — batch=8 drops to 36.7×, batch=16 to 37.1×, batch=32 to 36.1×. The L4's memory bandwidth is the constraint. At batch=4, the GPU is fully utilised. Beyond that, the overhead of managing larger batches through Whisper's chunked pipeline outweighs the parallelism benefit.

This is different from what you'd see on an L40S or A100, where the saturation point is higher. The optimal Whisper batch size is hardware-specific — do not copy configs from larger-GPU benchmarks directly.

Finding 3: Chunk-10 is an accuracy trap

Reducing chunk length from 30s to 10s causes WER to jump from 8.93% to 12.43% — a 3.5% absolute regression. Chunk=20 is much safer at 8.60%. The throughput gain from chunk=10 (30.5×) does not compensate for the accuracy loss, especially when batch=4 at chunk=30 already achieves 40.4× at 8.93% WER.

The cause: Whisper's encoder needs sufficient context to resolve ambiguous phonemes and cross-word boundaries. At 10s windows, many utterances are truncated mid-phrase, and the model's conditioning on previous chunks fails to fully recover. Do not reduce Whisper chunk size for throughput — use batch size instead.

Finding 4: condition_on_prev_tokens has no measurable effect

w18 (condition_on_prev=True, bs=4) and w03 (condition_on_prev=False, bs=4) produce identical WER (8.93%) and near-identical throughput (38.9× vs 40.4×). On LibriSpeech — clean, structured read speech — cross-chunk context conditioning adds nothing. On highly conversational or fragmented audio, the result might differ. For clean speech workloads, this parameter does not matter.

Finding 5: INT8 is a VRAM play, not a throughput play

INT8 batch=1 (w08) achieves the lowest VRAM of any Whisper config: 1,207 MB. But throughput drops to 15.7× — less than half the fp16 baseline. At batch=8 with INT8 (w09), throughput recovers to 23.8× with VRAM at 1,867 MB.

INT8 is the right choice when you need to pack multiple inference workers onto a single GPU. It is not the right choice for maximising single-stream throughput.

4. Parakeet Results and Findings

Full Results

ConfigWERRTFThroughputp50 LatVRAMPowerEnergy
p01 — baseline (bs=1, fp16, greedy)15.77%0.012579.9×0.087s5,171 MB39.0 W0.487 Wh/hrA
p02 — batch-4, fp1615.77%0.0050200.2×0.033s5,147 MB45.9 W0.229 Wh/hrA
p03 — batch-8, fp1615.72%0.0044228.3×0.031s5,143 MB50.0 W0.219 Wh/hrA
p04 — batch-16, fp1615.77%0.0046218.0×0.034s5,143 MB51.6 W0.237 Wh/hrA
p05 — batch-32, fp1615.77%0.0047214.7×0.035s5,143 MB53.1 W0.247 Wh/hrA
p06 — batch-64 max15.77%0.0049204.8×0.037s7,143 MB54.4 W0.266 Wh/hrA
p07 — bf16 baseline15.72%0.012083.0×0.088s5,153 MB43.8 W0.527 Wh/hrA
p08 — bf16 batch-415.77%0.0048207.1×0.031s5,155 MB48.0 W0.232 Wh/hrA
p09 — bf16 batch-815.72%0.0042238.9×0.029s5,151 MB50.9 W0.213 Wh/hrA
p10 — bf16 batch-1615.77%0.0044225.6×0.032s5,151 MB55.5 W0.246 Wh/hrA
p11 — bf16 batch-3215.72%0.0045219.8×0.034s5,151 MB54.0 W0.246 Wh/hrA
p12 — fp32 max accuracy15.72%0.011884.9×0.088s5,135 MB51.8 W0.610 Wh/hrA
p13 — fp32 batch-415.72%0.0052191.7×0.036s5,139 MB58.3 W0.304 Wh/hrA
p14 — beam-4 (bs=1)15.83% ⚠️0.027336.6×0.165s5,171 MB39.5 W1.079 Wh/hrA
p15 — batch-4 beam-415.94% ⚠️0.020548.8×0.133s5,147 MB40.6 W0.831 Wh/hrA
p16 — batch-8 beam-415.88% ⚠️0.019750.7×0.136s5,147 MB41.7 W0.822 Wh/hrA
p17 — batch-16 beam-415.99% ⚠️0.019950.2×0.146s5,143 MB42.4 W0.844 Wh/hrA
p18 — bf16 beam-4 (bs=1)15.77%0.020050.1×0.131s5,155 MB42.2 W0.843 Wh/hrA
p19 — bf16 batch-8 beam-415.61%0.019451.4×0.134s5,155 MB41.4 W0.805 Wh/hrA
p20 — bf16 batch-3215.72%0.0047214.7×0.034s7,149 MB53.8 W0.251 Wh/hrA

Finding 1: The throughput curve peaks at batch=8 and declines after

Going from batch=1 to batch=8 in fp16, throughput climbs from 79.9× to 228.3×. Then it declines: batch=16 drops to 218.0×, batch=32 to 214.7×, batch=64 to 204.8×. VRAM stays nearly constant from batch=1 through batch=32 (~5,143 MB), then jumps at batch=64 (7,143 MB) — without any throughput benefit.

The reason: NeMo's internal batching and dispatcher overhead increases with batch size. At batch=8, the GPU is fully saturated. Beyond that, the overhead of preparing, padding, and scheduling larger batches through the TDT decoder outweighs the additional parallelism. batch=8 wins on throughput, VRAM, and energy simultaneously.

Finding 2: Beam search on Parakeet is strictly worse on all axes

This is the clearest negative result in the benchmark. Beam=4 at batch=1 (p14) runs at 36.6× — less than half the greedy throughput at the same batch size (79.9×). Beam=4 WER (15.83%) is higher than greedy WER (15.77%). Every beam search config underperforms the equivalent greedy config on both speed and accuracy.

Why: TDT decoders predict both tokens and their durations jointly. The blank token mechanism already implicitly prunes low-probability paths during greedy decoding. Beam search explores alternative paths that the model's joint scoring function cannot meaningfully rank — producing no accuracy gain while multiplying decode time. Never use beam search with Parakeet.

Finding 3: FP32 provides no accuracy benefit

p12 (fp32, bs=1) and p01 (fp16, bs=1) both land at 15.72–15.77% WER. p12 draws 51.8W vs 39.0W for p01. FP32 adds 33% more power consumption with zero accuracy return on this model. Use fp16 or bf16 — fp32 is wasted compute for Parakeet.

Finding 4: bf16 batch=8 is the optimal config

p09 (bf16 batch=8) achieves 238.9× real-time at 50.9W and 0.213 Wh per hour of audio — the best combination of throughput, energy, and VRAM in the entire Parakeet grid.

5. Nemotron Results and Findings

Full Results

ConfigWERRTFThroughputp50 LatVRAMPowerEnergy
n01 — bs=1, fp16, frame=0.1s10.30%0.011983.8×0.086s1,917 MB39.6 W0.472 Wh/hrA
n02 — bs=1, fp16, frame=0.5s10.30%0.011586.9×0.088s1,713 MB41.5 W0.477 Wh/hrA
n03 — bs=1, fp16, frame=1.0s10.30%0.011587.2×0.084s1,753 MB44.0 W0.505 Wh/hrA
n04 — bs=1, fp16, frame=2.0s10.30%0.011487.4×0.087s1,793 MB45.6 W0.521 Wh/hrA
n05 — bs=1, fp16, frame=4.0s10.30%0.011487.8×0.087s1,833 MB47.3 W0.539 Wh/hrA
n06 — batch-2, fp16, frame=1.0s10.30%0.0063158.0×0.046s1,955 MB44.1 W0.279 Wh/hrA
n07 — batch-4, fp16, frame=1.0s10.30%0.0049202.5×0.025s2,195 MB44.7 W0.221 Wh/hrA
n08 — batch-8, fp16, frame=1.0s10.30%0.0045221.7×0.023s2,565 MB49.1 W0.221 Wh/hrA
n09 — batch-16, fp16, frame=1.0s10.30%0.0040247.5×0.020s3,267 MB48.3 W0.195 Wh/hrA
n10 — batch-4, fp16, frame=0.5s10.30%0.0046219.6×0.023s2,435 MB40.9 W0.186 Wh/hrA
n11 — batch-8, fp16, frame=0.5s10.30%0.0039258.9×0.018s2,785 MB48.3 W0.187 Wh/hrA
n12 — batch-4, fp16, frame=2.0s10.30%0.0045220.2×0.023s2,515 MB44.2 W0.201 Wh/hrA
n13 — batch-8, fp16, frame=2.0s10.30%0.0039258.5×0.017s2,805 MB49.1 W0.190 Wh/hrA
n14 — bs=1, bf16, frame=1.0s10.46%0.011587.0×0.088s2,241 MB48.7 W0.560 Wh/hrA
n15 — batch-4, bf16, frame=1.0s10.41%0.0047211.9×0.024s2,483 MB51.6 W0.244 Wh/hrA
n16 — batch-8, bf16, frame=1.0s10.46%0.0040253.1×0.018s2,773 MB45.7 W0.181 Wh/hrA
n17 — batch-4, bf16, frame=0.5s10.41%0.0046215.7×0.024s2,483 MB52.7 W0.244 Wh/hrA
n18 — batch-8, bf16, frame=2.0s10.46%0.0039256.4×0.018s2,773 MB48.0 W0.187 Wh/hrA

Finding 1: Frame length has zero effect on accuracy

WER is 10.30% across every single fp16 configuration — n01 through n13, from frame=0.1s to frame=4.0s, from batch=1 to batch=16. Not a single decimal point of change.

This is the cache-aware architecture working exactly as designed. Because the encoder processes each audio frame once and reuses its cached state, the model's acoustic representation does not change with chunk size. Frame length is a pure latency knob — you can tune it freely without touching accuracy.

For a live voice agent needing sub-20ms latency, use n11 (batch=8, frame=0.5s, p50=0.018s). For overnight batch transcription, use n09 (batch=16, frame=1.0s, 247.5× RT). Same weights, same accuracy, completely different operational profile.

Finding 2: Nemotron has the lowest VRAM footprint of any model here

At batch=1 frame=0.5s (n02), Nemotron uses 1,713 MB — lower than Whisper's 2,093 MB baseline, and dramatically lower than Parakeet's 5,143 MB. At batch=16 (n09), it uses only 3,267 MB.

On a 23 GB L4, a single card can serve 13 concurrent Nemotron instances at batch=1. For multi-tenant streaming ASR infrastructure, this is a significant operational advantage.

Finding 3: bf16 introduces a small but consistent WER regression

fp16 configs land at 10.30% WER across the board. bf16 configs land at 10.41%–10.46%. The throughput difference between fp16 and bf16 at the same batch size is marginal (~2%). fp16 is the better default for Nemotron — the WER regression in bf16 is not offset by any meaningful speed gain.

6. Power and Energy

ModelConfigAvg PowerVRAMEnergy / hr audio
Whisperbaseline w0168.6 W2,093 MB2.00 Wh/hrA
Whisperbf16 batch-4 w1064.9 W2,299 MB1.58 Wh/hrA
WhisperINT8 bs=1 w0850.9 W1,207 MB3.25 Wh/hrA
Parakeetbaseline p0139.0 W5,171 MB0.487 Wh/hrA
Parakeetbf16 batch-8 p0950.9 W5,151 MB0.213 Wh/hrA
Nemotronbs=1 frame=0.5s n0241.5 W1,713 MB0.477 Wh/hrA
Nemotronfp16 batch-8 frame=0.5s n1148.3 W2,785 MB0.187 Wh/hrA
Nemotronbf16 batch-8 n1645.7 W2,773 MB0.181 Wh/hrA

Parakeet and Nemotron at their optimal configs consume roughly 8–10× less energy per hour of audio than Whisper at baseline. The gap reflects architecture: Whisper's autoregressive decoder runs generation for every audio chunk. NeMo's transducer/streaming decoders are single-pass — the encoder runs once, the decoder is lightweight. More throughput per watt is not a coincidence; it is a structural property of the decoder design.

7. VRAM Headroom on 23 GB

ModelConfigPeak VRAMFree on L4Concurrent instances
Whisper INT8 bs=1w081,207 MB~21.8 GB~18
Nemotron bs=1 frame=0.5sn021,713 MB~21.3 GB~13
Whisper baselinew012,093 MB~20.9 GB~11
Nemotron batch-8n112,785 MB~20.2 GB~8
Whisper bf16 batch-4w102,299 MB~20.7 GB~10
Parakeet bf16 batch-8p095,151 MB~17.9 GB~4

Nemotron's compact footprint makes the L4 viable for multi-tenant streaming deployments that would otherwise require a larger card.

8. What We Didn't Test

A few things outside the scope of this benchmark that are worth noting:

  • Multi-GPU setups — all results are single L4. Tensor parallel configs for NeMo models on 2× or 4× GPU are not covered here.

  • Real-world audio — LibriSpeech is clean, structured read speech. Results on spontaneous conversational audio, phone-quality recordings, or heavy background noise would differ, especially for NeMo models.

  • Sustained load — we measured throughput on fixed batches. Latency under continuous concurrent stream load (10+ parallel streams) was not tested.

  • Multilingual — Whisper supports 99 languages. Parakeet and Nemotron are English-only. Cross-language accuracy was not measured.

  • TTFT (time to first token) — for streaming deployments, the latency to the first transcription output matters independently of throughput. Not measured here.

9. Summary and Decision Guide

Best Config Per Model

ModelBest configWERThroughputVRAMEnergy/hr audio
Whisperbf16 batch-4 sdpa (w10)8.93%41.1×2,299 MB1.58 Wh/hrA
Whisper (best accuracy)bf16 beam=4 (w19)8.11%26.9×2,121 MB2.64 Wh/hrA
Parakeetbf16 batch-8 (p09)15.72%*238.9×5,151 MB0.213 Wh/hrA
Nemotronfp16 batch-8 frame=0.5s (n11)10.30%*258.9×2,785 MB0.187 Wh/hrA

Decision Guide by Use Case

Use caseModel + Config
High-throughput batch transcriptionParakeet bf16 batch=8
Max throughput + lowest energyNemotron fp16 batch=8 frame=0.5s
Noisy / accented / diverse audioWhisper bf16 beam=4 sdpa
Live voice agent (low latency)Nemotron fp16 batch=1 frame=0.1s
Multi-stream on single GPUNemotron fp16 batch=1 (13 concurrent on L4)
VRAM-constrained, need WhisperWhisper INT8 batch=8

Decision Guide by Organisation Type

Startups and early-stage teams are usually optimising for one thing: cost per transcription hour while keeping quality acceptable. The L4 is the right card here — lower hourly rate than A100-class GPUs, Ada Lovelace architecture, 23 GB fits all three models comfortably. On a single L4, Nemotron at batch=8 delivers 258× real-time at 0.187 Wh per hour of audio. You can process an enormous volume of audio before the compute cost becomes significant. Parakeet is the alternative if accuracy on clean English is sufficient — 238× real-time at even lower energy per audio-hour. Both run on one card with no multi-GPU setup required.

Enterprises running production transcription pipelines — call centres, legal transcription, media captioning — typically have two concerns: accuracy on real-world audio and the ability to scale without re-architecting. Whisper is the right default for diverse, noisy, or accented audio given its broader training data. At bf16 batch=4 with SDPA, it runs at 41× real-time on a single L4 — meaning a small cluster of L4 instances handles significant concurrent load. For workloads where audio quality is controlled (internal meetings, studio recordings), swapping to Parakeet or Nemotron cuts energy cost by 7–8× with no accuracy regression on clean speech. The L4's 23 GB also allows running multiple smaller model instances per card — Nemotron's 1,713 MB footprint means up to 13 concurrent streams on one GPU, which maps well to enterprise multi-tenant deployments where you're serving many teams from shared infrastructure.

Teams building real-time voice products — voice agents, live captioning, real-time translation pipelines — need a model whose latency profile is tunable independently of accuracy. Nemotron is purpose-built for this. The cache-aware architecture means you set frame length based on your latency target, not based on accuracy constraints. At frame=0.1s, p50 latency is 86ms per utterance. At frame=0.5s, throughput jumps to 258× real-time with the same 10.30% WER. No other model in this benchmark offers that flexibility.

Conclusion

58 configurations. Three models. One NVIDIA L4.

The results that matter most: SDPA gives Whisper a free 1.8× throughput improvement that most deployments are leaving on the table. Beam search makes Parakeet strictly worse. Whisper chunk=10s silently destroys accuracy. Nemotron's WER is completely invariant to frame length — which means latency and throughput are independently tunable at zero accuracy cost.

The broader point is what the L4 makes possible. This is not an A100 or an H100. It is a 72W data-centre efficiency card with 23 GB of VRAM — and it ran all three models across 58 configs, with the best configs delivering 238–258× real-time throughput at under 55W average power draw. For startups watching compute spend closely, and for enterprises trying to run sustained transcription workloads without over-provisioning GPU capacity, the L4 hits a practical price-to-performance point that larger cards don't.

We ran everything — environment setup, model downloads, all 58 inference runs, metric collection — on a single E2E Cloud L4 instance. The same instance is available on-demand. If you want to reproduce these results or run your own model comparisons, the setup takes under 10 minutes on TIR.

Benchmark conducted on E2E Cloud TIR — NVIDIA L4, CUDA 12.4, PyTorch 2.6.0, NeMo 2.7.0, Transformers 4.43+

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.