Expoint – all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

Nvidia Director Capacity Engineering – DGX Cloud 
United States, California 
500046822

Today
US, CA, Santa Clara
time type
Full time
posted on
Posted 4 Days Ago
job requisition id

What you’ll be doing:

  • Lead end-to-end capacity strategy and forecasting for DGX Cloud across regions and cloud partners (Azure, OCI, GCP, etc.).

  • Define and implement golden-image standards for DGX nodes: firmware, CUDA/NVIDIA drivers, NCCL/InfiniBand, NVLink/NVSwitch fabrics.

  • Invent and operate automated maintenance and upgrade frameworks with near-zero customer impact, including guardrails, rollback plans, and buffer management.

  • Own service-level objectives (SLOs) for GPU availability, efficiency, and training/inference reliability; drive continuous improvement and root-cause analysis.

  • Guide development of orchestration tools and APIs coordinated with NVIDIA tools and DGX Cloud provisioning systems.

  • Partner with DGX Cloud software, data-center engineering, supply chain, and finance to align capacity, cost, and rollout priorities.

  • Recruit, mentor, and lead an elite team of capacity engineers, SREs, and tooling developers.

What we need to see:

  • 12+ overall years in large-scale infrastructure or site-reliability engineering, with 5+ years in senior leadership.

  • Bachelors or Masters in an engineering field or equivalent experience.

  • Deep understanding of GPU-accelerated compute, including DGX systems, NVLink/NVSwitch fabrics, InfiniBand/Ethernet networking, and high-performance storage.

  • Shown success in capacity planning and fleet consistency across multi-region or multi-cloud environments.

  • Expertise in driver/firmware management (CUDA stack, NCCL, OS/kernel dependencies) and distributed training workloads.

  • Proven track record to deliver against strict availability and performance SLOs at hyperscale.

Ways to stand out from the crowd:

  • Experience with hybrid cloud deployments and hyperscale partnerships.

  • Familiarity with Kubernetes GPU scheduling, and AI/ML workload patterns.

  • Track record of influencing hardware/system roadmaps (DGX, Grace Hopper, next-gen GPUs) based on capacity insights.

  • Strong interpersonal skills to align executives, engineers, and partners around ambitious capacity targets.

You will also be eligible for equity and .