Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Nvidia Senior Manager Technical Program Management - DGX Cloud 
United States, California 
49030427

28.07.2025
US, CA, Santa Clara
time type
Full time
posted on
Posted 2 Days Ago
job requisition id

What you will be doing:

  • Lead with impact to build and scale a high-performing team of Technical Program Managers focused on delivering a world-class AI platform that empowers over 1K++ NVIDIA researchers. Ensure the team are customer-obsessed, prioritizing developer productivity, platform usability, and end-to-end user experience

  • Deep understanding of Slurm: architecture, configuration, workload management, jobprioritization/fair-share

  • Experience with end-to-end cluster bring-ups and integration with MLOps stacks, including deep familiarity with operational models, Fleet efficiency metrics and deployment across hyperscaler environments such as OCI, GCP, and others

  • Skilled in capacity modeling, demand forecasting, and supply-demand balancing, with experience using prioritisation frameworks and collaborating with governance teams to define and implement prioritisation strategies.

  • Lead initiatives to reduce GPU idle waste, and improve cluster utilization metrics. Drive developer-centric programs and own the execution of key initiatives that accelerate internal developer velocity

  • Establish and enforce best-in-class program governance, roadmap planning, and risk management processes. Encourage transparency and accountability throughout engineering and operations by defining clear important metrics and reporting frameworks

  • Develop and execute a communication strategy that keeps stakeholders advised at all levels—from engineering contributors to NVIDIA leadership—about program progress, blockers, and impact.

What we need to see:

  • 15+ overall years of program management experience leading large-scale software, AI/ML and infrastructure programs in fast-paced, matrixed environments. This includes 8+ years of managing a team.

  • Hands-on experience driving programs that support AI/ML platform development, including workload orchestration, platform reliability, researcher tooling, GPU resource management, hardware readiness states, and integration with customer MLOps pipelines

  • Proven track record delivering sophisticated AI/ML infrastructure programs at scale—ideally in cloud, hyperscaler, or enterprise datacenter settings—with a deep understanding of system architecture and cluster deployments.

  • Strong grasp of capacity modeling, forecasting techniques, and demand/supply reconciliation in compute environment with fleet-wide metrics such as availability, utilization, occupancy and the ability to use the data to drive operational excellence and roadmap prioritizations

  • Proficiency with tools like Grafana, Prometheus, or scheduler-native tools to monitor job efficiency, wait times, and node health

  • MS CS degree, or a related technical field, or equivalent experience.

Ways to stand out from the crowd:

  • Highly motivated with strong communication skills, with proven track record to work successfully with multi-functional teams and coordinate effectively across organizational boundaries and geographies.

  • Solid understanding of cloud technologies is a plus.

  • Experience with new product introduction and program managing research teams.

  • Background with productivity tools and process automation is a big plus.

You will also be eligible for equity and .