Expoint – all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

Nvidia Director Technical Program Management - AI ML Platforms 
United States, California 
350323486

Today
US, CA, Santa Clara
time type
Full time
posted on
Posted 10 Days Ago
job requisition id

What You’ll Be Doing:

  • Lead and scale the Technical Program Management organization responsible for the DGX Cloud AI/ML platform, enabling over 1,000+ NVIDIA researchers globally.

  • Drive the roadmap for end-to-end AI/ML infrastructure, spanning cluster bring-up, workload orchestration, GPU resource management, and integration with MLOps pipelines.

  • Collaborate with leaders in technology and innovation to outline platform needs, synchronize computing approach with AI model advancement, and provide a seamless researcher journey.

  • Lead complex programs involving next-generation systems (e.g., GB200) and fleet-wide scaling initiatives across OCI, GCP, and other hyperscalers.

  • Own platform efficiency and capacity management, using deep understanding of scheduling systems (e.g., Slurm, hybrid models) to optimize job placement, utilization, and turnaround time.

  • Establish data-driven operational metrics availability, occupancy, wait times, throughput and use them to guide continuous improvement and prioritization.

  • Implement governance and visibility frameworks that drive alignment, predictability, and accountability across AI platform initiatives.

  • Represent DGX Cloud programs to senior leadership, clearly articulating impact, risk, and value across engineering and research organizations.

What We Need to See:

  • 15+ overall years of technical program management experience, including 7+ years leading and developing TPM teams in infrastructure, AI/ML, or platform engineering domains.

  • Demonstrated success in implementing AI and machine learning systems and platform initiatives at a large scale encompassing workload coordination, data pipeline incorporation, model training environments, and GPU fleet supervision.

  • Deep technical understanding of AI/ML workflows, job scheduling (Slurm, Kubernetes, hybrid orchestration), and large-scale distributed systems.

  • Proficiency in optimizing resource usage and monitoring performance metrics in compute-heavy settings.

  • Experience building platforms across cloud and on-prem hybrid architectures, integrating with internal and external MLOps stacks.

  • Proficiency with observability and telemetry tools (e.g., Grafana, Prometheus) for infrastructure monitoring and performance analysis.

  • Bachelor or Master in Computer Science, Engineering, or related field (or equivalent experience).

Ways to Stand Out from the Crowd:

  • Proficient in AI/ML systems, model lifecycle oversight, and developer tools for extensive training tasks.

  • Track record driving R&D productivity platforms and reducing friction for machine learning practitioners.

  • Experience in new product introduction (NPI) for research and infrastructure systems.

  • Deep familiarity with cloud compute and orchestration technologies, and a passion for automation and operational excellence.

  • Executive communication skills, able to translate complex technical programs into clear business and research outcomes.

You will also be eligible for equity and .