Scaling AI in production: What Nobody Tells You

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Claim Free Credits View Pricing

What Nobody Tells You

Key takeaways from a talk at the MLDS event on GPU inference optimization covering model tuning, concurrency, request routing, and real-world case studies from the E2E TIR platform team.

Everyone has access to the same frontier models now. So where does the actual competitive edge come from? According to the team behind E2E Cloud TIR platform (an end-to-end AI development platform), the answer is increasingly about how efficiently you can serve that intelligence at scale.

At a recent MLDS event, the team shared a practical framework built from working with dozens of AI product teams across India. Here's what they learned.

The three metrics that actually matter

When you strip away the hype, production AI performance comes down to three numbers:

AI performance metrics throughput latency GPU utilization

The key insight: Your revenue will always be limited by how much you can utilize your GPUs. Throughput is not a technical detail, it's your top line.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Claim Free Credits View Pricing

The optimization stack (work through it sequentially)

You don't need to optimize everything on day one. Think of it as a layered approach where you work from the top down as your scale grows.

Inference Optimization

Layer 1: Model tuning (start here)

Quantization the easiest win

Quantization means storing model weights in a shorter data format. It sounds like a compromise, but for models above 30B parameters, the quality loss is often negligible.

Quantization the easiest win

Real example: DeepSeek was trained and served entirely in FP8 format and it's still one of the leading models in benchmark rankings. Quantization is no longer a compromise; it's standard practice.

Smaller models

Before accepting a large model's resource footprint, always test a smaller alternative. Recent models like Nemotron-3 30B use a Mixture-of-Experts (MoE) architecture similar to DeepSeek and Qwen and often punch well above their parameter count.

Testing tip: Build a benchmark spreadsheet with 40–50 hard, domain-specific questions. Test full-size vs quantized vs smaller models on the same set. You'll often be surprised by the results.

Layer 2: The inference engine

Once your model is right-sized, the next question is: are you getting all the concurrency your GPU can deliver?

GPU concurrency and inference throughput diagram

The concurrency limit of your deployment is defined by how much VRAM is left after the model loads. This is why the world can't infinitely scale AI every GPU worker has a hard ceiling on simultaneous requests.

Key runtime techniques

Continuous batching
Instead of waiting for a full batch, process requests as they arrive. This dramatically improves GPU utilization and reduces average latency.
Priority queuing
Route paid or premium users to a priority queue so they always get lower latency without over-provisioning for everyone.
Disaggregated prefill / decode
When input prompts are long but responses are short (e.g. RAG over documents), separate the "understanding" phase (prefill) from the "generation" phase (decode) onto different workers. NVIDIA Dynamo enables this today.

Case studies from the field

Case study 1 - Auto-scaling boot time

Problem: Llama 3 INT4 model took 2 minutes to boot on a new worker during traffic spikes by which time the spike was already over.

Solution: Saved the full GPU process context and restored from it on startup.

Result: Boot time dropped from 2 minutes → 20 seconds.

Case study 2 - Gemma 27B GPU scaling

The team tested Gemma 27B across different GPU configurations to find the optimal cost/latency tradeoff:

Gemma 27B GPU scaling

"Knowing these numbers makes a lot of difference; it's a conscious choice you're making, not just guessing."

Case study 3 - 10 billion tokens per day

Setup: 48× H200 GPUs running, targeting 10B tokens/day in throughput mode.

Finding: Workers were only receiving ~10% of their capacity in requests the bottleneck was entirely on the client side, not the GPUs.

Fix: Added a client-side request queue and monitoring to ensure GPU workers stayed saturated.

Layer 3: Network and routing

When you're targeting sub-300ms latency, every hop in the request path matters.

Prefix / KV-cache routing

Instead of round-robin routing, smart load balancers check whether a GPU worker has already processed a similar prompt prefix. If it has, it still holds the KV-cache for that prefix, and the request costs significantly less compute.

Important caveat: Prefix caching only works when prompts share a common prefix. If your system prompt or document always comes first and stays constant, you benefit. If prompts are highly unique or variables appear at the start, caching won't help much.

Latency vs throughput: choosing your GPU split

Latency vs throughput

Don't forget the client side

Server optimization only helps if the client is feeding the server correctly. A common mistake: sophisticated GPU setups underutilized because requests arrive in bursts or only to one worker.

Add a client-side queue
Buffer requests on the client and release them at a steady rate that matches worker capacity. Don't rely on the inference engine to absorb spikes — it will take everything and choke other workers.
Monitor request distribution
Track which workers are receiving requests. Imbalanced load (80% to worker 1, 20% to worker 2) wastes capacity and inflates costs.
Use gRPC for real-time workloads
For STT/TTS or streaming use cases, gRPC outperforms REST lower overhead and native streaming support.

The bigger picture: shared AI infrastructure

One underrated optimization: consolidate model usage across teams. If your DevOps team, product team, and data team are all calling separate AI endpoints with different prompts on different infrastructure, you're paying a steep inefficiency tax.

"A lot of teams are trying to build an AI factory within their company, one model infrastructure, one API, multiple use cases. That's where economics starts to make sense."

Getting multiple internal teams onto a shared model layer is hard organizationally, but the TCO improvement can be significant especially for enterprises where a dedicated SaaS model isn't viable due to data privacy requirements.

Quick reference: where to start

Decision guide — what to optimize first

Your situation	First move
Using a commercial API (OpenAI, Claude, etc.)	Try an open-source model on your eval set first
Running a 70B+ model on multiple GPUs	Quantize to FP8 likely zero quality loss
Latency is high but GPU utilization is low	Check client-side request pacing and queue depth
Throughput plateau despite adding GPUs	Profile concurrency per worker model may not be saturated
Long prompts + short responses (RAG)	Explore disaggregated prefill/decode setup
Auto-scaling with cold start problems	Checkpoint GPU process state for faster restores

Scaling AI in production: What Nobody Tells You

Get ₹2,000 free credits to test your AI workloads

What Nobody Tells You

The three metrics that actually matter

Get ₹2,000 free credits to test your AI workloads

The optimization stack (work through it sequentially)

Layer 1: Model tuning (start here)

Quantization the easiest win

Smaller models

Layer 2: The inference engine

Key runtime techniques

Case studies from the field

Case study 1 - Auto-scaling boot time

Case study 2 - Gemma 27B GPU scaling

Case study 3 - 10 billion tokens per day

Layer 3: Network and routing

Prefix / KV-cache routing

Latency vs throughput: choosing your GPU split

Don't forget the client side

The bigger picture: shared AI infrastructure

Quick reference: where to start

Decision guide — what to optimize first

Get ₹2,000 free credits to test your AI workloads

Related Articles

E2E goes live with next-generation NVIDIA B200 cluster deployed using NVIDIA Certified Reference Architecture

Running AI at Scale: The Infrastructure Reality Nobody Talks About

Benchmarking Open ASR Models on NVIDIA L4: Parakeet vs Whisper vs Nemotron Speech

GPU Cloud

Company

Legal & Policies

Investor Relations

Resources