How We Learned to Stop Fighting Our GPU Servers

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Claim Free Credits View Pricing

A Practical Guide to AI Infrastructure Stability

Lessons from building NarrateAI — a multi-VM AI pipeline running Parakeet ASR, Qwen2.5-72B, and Nemotron on E2E Networks GPU cloud

The Setup

We built NarrateAI — an internal AI platform that takes a product demo video and generates 13 types of content: technical blogs, LinkedIn posts, API references, battlecards, email announcements, and more.

The architecture spans three GPU VMs:

VM1 (4x A30): FastAPI backend + Celery workers + Next.js frontend + Qwen2.5-72B for content intelligence
VM2 (4x A30): Qwen2.5-72B dedicated for parallel artifact generation
L4 VM (1x L4): NVIDIA Parakeet TDT 1.1B for audio transcription

Everything was working. Then we ran apt upgrade.

What followed was a 6-hour debugging session that taught us more about GPU infrastructure stability than any documentation ever could.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Claim Free Credits View Pricing

Problem 1: Kernel Upgrades Break NVIDIA Drivers

What Happened

We ran a routine apt install which upgraded the kernel:

From: 5.15.0-94-generic
To: 5.15.0-171-generic

After reboot:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.

The Diagnosis

bash

uname -r
# 5.15.0-171-generic

modprobe nvidia
# FATAL: Module nvidia not found

The Fix

bash

grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 5.15.0-94-generic"
reboot

apt-mark hold linux-image-$(uname -r) linux-headers-$(uname -r)
apt-mark hold linux-image-generic linux-headers-generic

The Rule

Never allow automatic kernel upgrades on GPU servers.

Problem 2: needrestart Dialog Blocks Everything

What Happened

A full-screen ncurses dialog froze terminal sessions during apt install.

The Fix

bash

mkdir -p /etc/needrestart/conf.d/
echo '$nrconf{restart} = "a";' > /etc/needrestart/conf.d/autorestart.conf
echo '$nrconf{kernelhints} = 0;' >> /etc/needrestart/conf.d/autorestart.conf

Kill stuck process:

bash

pkill -f needrestart

The Rule

Always disable interactive system prompts on production GPU servers.

Problem 3: Docker CE vs docker.io Conflict

Root Cause

Both Docker versions installed simultaneously.

The Fix

bash

apt remove -y docker-ce docker-ce-rootless-extras docker-buildx-plugin
apt install -y docker.io
rm -f /run/docker.pid /run/docker.sock
systemctl start docker

The Rule

Use only one Docker installation method.

Problem 4: No Startup Automation

The Fix — Ordered Startup Script

bash

#!/bin/bash
sleep 10

systemctl start postgresql redis

screen -dmS vllm bash -c "python3 -m vllm.entrypoints.openai.api_server \
  --model /opt/models/qwen2.5-72b --tensor-parallel-size 4 \
  --port 8001 --served-model-name narrateai-llm"

for i in $(seq 1 36); do
  sleep 5
  curl -s http://localhost:8001/v1/models > /dev/null && break
done

screen -dmS api bash -c "uvicorn app.main:app --host 0.0.0.0 --port 8000"
screen -dmS worker bash -c "celery -A app.workers.pipeline:celery_app worker"
screen -dmS frontend bash -c "npm start"

The Rule

Use health checks and enforce startup order.

Problem 5: vLLM OOM Issues

The Fix

bash

nvidia-smi | grep MiB
nvidia-smi | grep python
kill -9 <PID>

The Rule

70B models require dedicated GPU nodes.

Problem 6: asyncpg JSONB Error

Error

invalid input for query argument

Fix

python

import json

params = {"audience": json.dumps(audience_list)}

The Rule

Always serialize JSON manually for asyncpg.

Problem 7: Next.js Turbopack Issues

The Fix

bash

pkill -f "next-server"
sleep 3
npm start

Optional downgrade:

bash

npm install next@15.2.2 --save

The Rule

Kill old processes before redeployments.

GPU Server Setup Checklist

bash

apt-mark hold linux-image-$(uname -r)
mkdir -p /etc/needrestart/conf.d/
nvidia-smi
update-grub
systemctl enable your-service

Troubleshooting Decision Tree

Can you SSH?
Does nvidia-smi work?
Is service running?
Is port open?
Check dependencies: PostgreSQL, Redis, model endpoints

Key Takeaways

Pin your kernel
Disable needrestart
Write startup scripts
Avoid Docker conflicts
Plan GPU memory
Serialize JSON properly
Kill stale processes

About NarrateAI

NarrateAI is an internal AI platform at E2E Networks that transforms product demo videos into a full content marketing package.

Built on E2E Networks GPU Cloud — India's sovereign AI infrastructure.

How We Learned to Stop Fighting Our GPU Servers

Get ₹2,000 free credits to test your AI workloads

A Practical Guide to AI Infrastructure Stability

The Setup

Get ₹2,000 free credits to test your AI workloads

Problem 1: Kernel Upgrades Break NVIDIA Drivers

What Happened

The Diagnosis

The Fix

The Rule

Problem 2: needrestart Dialog Blocks Everything

What Happened

The Fix

The Rule

Problem 3: Docker CE vs docker.io Conflict

Root Cause

The Fix

The Rule

Problem 4: No Startup Automation

The Fix — Ordered Startup Script

The Rule

Problem 5: vLLM OOM Issues

The Fix

The Rule

Problem 6: asyncpg JSONB Error

Error

Fix

The Rule

Problem 7: Next.js Turbopack Issues

The Fix

The Rule

GPU Server Setup Checklist

Troubleshooting Decision Tree

Key Takeaways

About NarrateAI

Get ₹2,000 free credits to test your AI workloads

Related Articles

E2E goes live with next-generation NVIDIA B200 cluster deployed using NVIDIA Certified Reference Architecture

Running AI at Scale: The Infrastructure Reality Nobody Talks About

Scaling AI in production: What Nobody Tells You