Get ₹2,000 free credits to test your AI workloads
Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.
A Practical Guide to AI Infrastructure Stability
Lessons from building NarrateAI — a multi-VM AI pipeline running Parakeet ASR, Qwen2.5-72B, and Nemotron on E2E Networks GPU cloud
The Setup
We built NarrateAI — an internal AI platform that takes a product demo video and generates 13 types of content: technical blogs, LinkedIn posts, API references, battlecards, email announcements, and more.
The architecture spans three GPU VMs:
- VM1 (4x A30): FastAPI backend + Celery workers + Next.js frontend + Qwen2.5-72B for content intelligence
- VM2 (4x A30): Qwen2.5-72B dedicated for parallel artifact generation
- L4 VM (1x L4): NVIDIA Parakeet TDT 1.1B for audio transcription
Everything was working. Then we ran apt upgrade.
What followed was a 6-hour debugging session that taught us more about GPU infrastructure stability than any documentation ever could.
Get ₹2,000 free credits to test your AI workloads
Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.
Problem 1: Kernel Upgrades Break NVIDIA Drivers
What Happened
We ran a routine apt install which upgraded the kernel:
- From:
5.15.0-94-generic - To:
5.15.0-171-generic
After reboot:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
The Diagnosis
uname -r
# 5.15.0-171-generic
modprobe nvidia
# FATAL: Module nvidia not foundThe Fix
grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 5.15.0-94-generic"
reboot
apt-mark hold linux-image-$(uname -r) linux-headers-$(uname -r)
apt-mark hold linux-image-generic linux-headers-genericThe Rule
Never allow automatic kernel upgrades on GPU servers.
Problem 2: needrestart Dialog Blocks Everything
What Happened
A full-screen ncurses dialog froze terminal sessions during apt install.
The Fix
mkdir -p /etc/needrestart/conf.d/
echo '$nrconf{restart} = "a";' > /etc/needrestart/conf.d/autorestart.conf
echo '$nrconf{kernelhints} = 0;' >> /etc/needrestart/conf.d/autorestart.confKill stuck process:
pkill -f needrestartThe Rule
Always disable interactive system prompts on production GPU servers.
Problem 3: Docker CE vs docker.io Conflict
Root Cause
Both Docker versions installed simultaneously.
The Fix
apt remove -y docker-ce docker-ce-rootless-extras docker-buildx-plugin
apt install -y docker.io
rm -f /run/docker.pid /run/docker.sock
systemctl start dockerThe Rule
Use only one Docker installation method.
Problem 4: No Startup Automation
The Fix — Ordered Startup Script
#!/bin/bash
sleep 10
systemctl start postgresql redis
screen -dmS vllm bash -c "python3 -m vllm.entrypoints.openai.api_server \
--model /opt/models/qwen2.5-72b --tensor-parallel-size 4 \
--port 8001 --served-model-name narrateai-llm"
for i in $(seq 1 36); do
sleep 5
curl -s http://localhost:8001/v1/models > /dev/null && break
done
screen -dmS api bash -c "uvicorn app.main:app --host 0.0.0.0 --port 8000"
screen -dmS worker bash -c "celery -A app.workers.pipeline:celery_app worker"
screen -dmS frontend bash -c "npm start"The Rule
Use health checks and enforce startup order.
Problem 5: vLLM OOM Issues
The Fix
nvidia-smi | grep MiB
nvidia-smi | grep python
kill -9 <PID>The Rule
70B models require dedicated GPU nodes.
Problem 6: asyncpg JSONB Error
Error
invalid input for query argument
Fix
import json
params = {"audience": json.dumps(audience_list)}The Rule
Always serialize JSON manually for asyncpg.
Problem 7: Next.js Turbopack Issues
The Fix
pkill -f "next-server"
sleep 3
npm startOptional downgrade:
npm install next@15.2.2 --saveThe Rule
Kill old processes before redeployments.
GPU Server Setup Checklist
apt-mark hold linux-image-$(uname -r)
mkdir -p /etc/needrestart/conf.d/
nvidia-smi
update-grub
systemctl enable your-serviceTroubleshooting Decision Tree
- Can you SSH?
- Does
nvidia-smiwork? - Is service running?
- Is port open?
- Check dependencies: PostgreSQL, Redis, model endpoints
Key Takeaways
- Pin your kernel
- Disable needrestart
- Write startup scripts
- Avoid Docker conflicts
- Plan GPU memory
- Serialize JSON properly
- Kill stale processes
About NarrateAI
NarrateAI is an internal AI platform at E2E Networks that transforms product demo videos into a full content marketing package.
Built on E2E Networks GPU Cloud — India's sovereign AI infrastructure.


