What you will be doing:
Lead with impact to build and scale a high-performing team of Technical Program Managers focused on delivering a world-class AI platform that empowers over 1K++ NVIDIA researchers. Ensure the team are customer-obsessed, prioritizing developer productivity, platform usability, and end-to-end user experience
Deep understanding of Slurm: architecture, configuration, workload management, jobprioritization/fair-share
Experience with end-to-end cluster bring-ups and integration with MLOps stacks, including deep familiarity with operational models, Fleet efficiency metrics and deployment across hyperscaler environments such as OCI, GCP, and others
Skilled in capacity modeling, demand forecasting, and supply-demand balancing, with experience using prioritisation frameworks and collaborating with governance teams to define and implement prioritisation strategies.
Lead initiatives to reduce GPU idle waste, and improve cluster utilization metrics. Drive developer-centric programs and own the execution of key initiatives that accelerate internal developer velocity
Establish and enforce best-in-class program governance, roadmap planning, and risk management processes. Encourage transparency and accountability throughout engineering and operations by defining clear important metrics and reporting frameworks
Develop and execute a communication strategy that keeps stakeholders advised at all levels—from engineering contributors to NVIDIA leadership—about program progress, blockers, and impact.
What we need to see:
15+ overall years of program management experience leading large-scale software, AI/ML and infrastructure programs in fast-paced, matrixed environments. This includes 8+ years of managing a team.
Hands-on experience driving programs that support AI/ML platform development, including workload orchestration, platform reliability, researcher tooling, GPU resource management, hardware readiness states, and integration with customer MLOps pipelines
Proven track record delivering sophisticated AI/ML infrastructure programs at scale—ideally in cloud, hyperscaler, or enterprise datacenter settings—with a deep understanding of system architecture and cluster deployments.
Strong grasp of capacity modeling, forecasting techniques, and demand/supply reconciliation in compute environment with fleet-wide metrics such as availability, utilization, occupancy and the ability to use the data to drive operational excellence and roadmap prioritizations
Proficiency with tools like Grafana, Prometheus, or scheduler-native tools to monitor job efficiency, wait times, and node health
MS CS degree, or a related technical field, or equivalent experience.
Ways to stand out from the crowd:
Highly motivated with strong communication skills, with proven track record to work successfully with multi-functional teams and coordinate effectively across organizational boundaries and geographies.
Solid understanding of cloud technologies is a plus.
Experience with new product introduction and program managing research teams.
Background with productivity tools and process automation is a big plus.
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך