Accelerated Computing Engineer
2-4 yearsKancheepuram, Chennai
We are looking for an Accelerated Computing Engineer to join our Cloud Platform team and take ownership of GPU and AI infrastructure operations for enterprise and research workloads on the E2E Cloud platform. In this role, you will be responsible for building, managing, and troubleshooting GPU clusters (for training and inference), as well as directly engaging with customers to help them deploy and optimize their AI workloads.
Roles & Responsibilities:
1. Cluster and GPU Management :
- •Launch, validate, and maintain GPU-based AI/ML training clusters (8xH100, 8xH200, 32xH200, 64xH200 upto 1024 H200s).
- •Verify all cluster nodes have InfiniBand enabled and GPUs correctly assigned (no Ethernet fallback).
- •Ensure Slurm deployments are up within a minute and all workers are ready (sinfo shows all active).
- •Validate DGCX Workbench runs successfully for:
- •Llama3-8B on 8xH100 / 8xH200 cluster
- •Llama3-70B on 32xH200 / 64xH200 clusters
Monitor GPU health using tools and validate performance benchmarks.
Maintain cluster reliability — all training and inference nodes should remain up and restartable
without failure.
2. Inference and Endpoint Operations :
- •Launch and monitor vLLM inference endpoints (e.g., Llama 70B) ensuring:
- •First startup within 10 minutes
- • Restart within 1 minute
- •Autoscale brings new workers up within 3 minutes
- •Inference endpoints remain continuously reachable and 100% ready
- • Troubleshoot and stabilize stateful workloads, notebooks, and AI services.
3. Customer-Facing Technical Support :
- •Engage directly with customers via calls, video meetings (Google Meet/Hangout), and screen-sharing sessions.
- •Understand the customer’s problem in real time and guide them through the solution.
- • Diagnose complex GPU, Slurm, or inference issues and resolve them collaboratively on the call.
- •Provide clear updates and ensure timely resolution of support tickets.
- • Document RCA and contribute to permanent fixes or product improvements.
- • Communicate professionally and technically with data scientists, developers, and enterprise users.Automation and Reliability
- • Automate cluster provisioning and monitoring using Terraform, Ansible, and Python.
- •Create scripts for routine cluster health checks, GPU utilization, and job queue validation.
- •Collaborate with the platform and DevOps teams to implement improvements for speed and reliability.
Key Skills & Qualifications:
- •2–4 years of experience in GPU-based cloud operations, MLOps, or infrastructure engineering.
- •Prior exposure to customer-facing roles or live technical troubleshooting calls.
- •Experience working with AI model training pipelines, inference endpoints, or Slurm-managed clusters.
- •Familiarity with LLM workloads such as Llama, Mistral, or Falcon models.
- •Linux (Ubuntu/CentOS), system performance tuning
- •Networking: InfiniBand, VLAN, VPN, ALB, DNS, NAT
- •Containers & orchestration: Docker, Kubernetes, Helm
- •GPU operations: CUDA, GPU drivers, nvidia-smi, MIG configuration
- •Distributed training: Slurm, DDP (Distributed Data Parallel) concepts
- •AI Inference: vLLM, TensorRT, ONNX Runtime, Hugging Face models
- •Infrastructure as Code: Terraform, Ansible
- •Tools: ssh, curl, tcpdump, Prometheus, Grafana, ELK