How We Learned to Stop Fighting Our GPU Servers

EN
E2E Networks

Content Team @ E2E Networks

March 27, 2026·4 min read
Share this article
Link copied to clipboard
Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

A Practical Guide to AI Infrastructure Stability

Lessons from building NarrateAI — a multi-VM AI pipeline running Parakeet ASR, Qwen2.5-72B, and Nemotron on E2E Networks GPU cloud


The Setup

We built NarrateAI — an internal AI platform that takes a product demo video and generates 13 types of content: technical blogs, LinkedIn posts, API references, battlecards, email announcements, and more.

The architecture spans three GPU VMs:

  1. VM1 (4x A30): FastAPI backend + Celery workers + Next.js frontend + Qwen2.5-72B for content intelligence
  2. VM2 (4x A30): Qwen2.5-72B dedicated for parallel artifact generation
  3. L4 VM (1x L4): NVIDIA Parakeet TDT 1.1B for audio transcription

Everything was working. Then we ran apt upgrade.

What followed was a 6-hour debugging session that taught us more about GPU infrastructure stability than any documentation ever could.


Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Problem 1: Kernel Upgrades Break NVIDIA Drivers

What Happened

We ran a routine apt install which upgraded the kernel:

  • From: 5.15.0-94-generic
  • To: 5.15.0-171-generic

After reboot:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.

The Diagnosis

bash
uname -r # 5.15.0-171-generic modprobe nvidia # FATAL: Module nvidia not found

The Fix

bash
grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 5.15.0-94-generic" reboot apt-mark hold linux-image-$(uname -r) linux-headers-$(uname -r) apt-mark hold linux-image-generic linux-headers-generic

The Rule

Never allow automatic kernel upgrades on GPU servers.


Problem 2: needrestart Dialog Blocks Everything

What Happened

A full-screen ncurses dialog froze terminal sessions during apt install.

The Fix

bash
mkdir -p /etc/needrestart/conf.d/ echo '$nrconf{restart} = "a";' > /etc/needrestart/conf.d/autorestart.conf echo '$nrconf{kernelhints} = 0;' >> /etc/needrestart/conf.d/autorestart.conf

Kill stuck process:

bash
pkill -f needrestart

The Rule

Always disable interactive system prompts on production GPU servers.


Problem 3: Docker CE vs docker.io Conflict

Root Cause

Both Docker versions installed simultaneously.

The Fix

bash
apt remove -y docker-ce docker-ce-rootless-extras docker-buildx-plugin apt install -y docker.io rm -f /run/docker.pid /run/docker.sock systemctl start docker

The Rule

Use only one Docker installation method.


Problem 4: No Startup Automation

The Fix — Ordered Startup Script

bash
#!/bin/bash sleep 10 systemctl start postgresql redis screen -dmS vllm bash -c "python3 -m vllm.entrypoints.openai.api_server \ --model /opt/models/qwen2.5-72b --tensor-parallel-size 4 \ --port 8001 --served-model-name narrateai-llm" for i in $(seq 1 36); do sleep 5 curl -s http://localhost:8001/v1/models > /dev/null && break done screen -dmS api bash -c "uvicorn app.main:app --host 0.0.0.0 --port 8000" screen -dmS worker bash -c "celery -A app.workers.pipeline:celery_app worker" screen -dmS frontend bash -c "npm start"

The Rule

Use health checks and enforce startup order.


Problem 5: vLLM OOM Issues

The Fix

bash
nvidia-smi | grep MiB nvidia-smi | grep python kill -9 <PID>

The Rule

70B models require dedicated GPU nodes.


Problem 6: asyncpg JSONB Error

Error

invalid input for query argument

Fix

python
import json params = {"audience": json.dumps(audience_list)}

The Rule

Always serialize JSON manually for asyncpg.


Problem 7: Next.js Turbopack Issues

The Fix

bash
pkill -f "next-server" sleep 3 npm start

Optional downgrade:

bash
npm install next@15.2.2 --save

The Rule

Kill old processes before redeployments.


GPU Server Setup Checklist

bash
apt-mark hold linux-image-$(uname -r) mkdir -p /etc/needrestart/conf.d/ nvidia-smi update-grub systemctl enable your-service

Troubleshooting Decision Tree

  1. Can you SSH?
  2. Does nvidia-smi work?
  3. Is service running?
  4. Is port open?
  5. Check dependencies: PostgreSQL, Redis, model endpoints

Key Takeaways

  • Pin your kernel
  • Disable needrestart
  • Write startup scripts
  • Avoid Docker conflicts
  • Plan GPU memory
  • Serialize JSON properly
  • Kill stale processes

About NarrateAI

NarrateAI is an internal AI platform at E2E Networks that transforms product demo videos into a full content marketing package.

Built on E2E Networks GPU Cloud — India's sovereign AI infrastructure.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.