Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Senior Manager AI Infrastructure Engineering - DGX Cloud 
United States, Texas 
842203355

08.07.2025
US, CA, Santa Clara
US, Remote
time type
Full time
posted on
Posted Yesterday
job requisition id

We are looking for a seasoned leader to provide strategic direction and guidance to the operational and AI tooling team in a large-scale, mission-critical environment. This is a people-focused role where you will drive operational excellence by maturing our incident response, change management, problem management, and issue management tooling. You will foster a blameless culture of continuous improvement, while empowering your team to leverage and contribute to both foundational infrastructure and pioneering AI/ML tools for smarter debugging, automation, knowledge sharing, and post-incident learning.

What you’ll be doing:
  • Lead a team of software and AI engineers responsible for building systems that power incident, change, and problem management across the DGXC Universe.

  • Guide technical strategy and execution across multiple workstreams across multiple teams

  • Partner with infrastructure, product, and security teams to design resilient systems and enforce consistent operational practices across domains.

  • Own the roadmap and delivery of AI/ML-powered operational tools—improving incident classification, root cause analysis, and mitigation automation.

  • Develop and scale engineering teams to meet growing demands; mentor engineers and technical leads while fostering a high-performing, collaborative culture.

  • Represent the team in cross-organizational forums, drive alignment on priorities, and provide clear technical direction during complex or high-severity incidents.

What we need to see:
  • 12+ overall years of proven experience in software engineering or related technical roles, with 5+ years in engineering leadership, including managing multiple teams or technical programs.

  • BS degree in Computer Science or a related technical field, or equivalent experience.

  • Demonstrated success in building and scaling systems that support incident, change, and problem management across complex, high-scale environments.

  • Experience leading teams in the adoption and integration of AI/ML technologies to enhance operational insight, efficiency, and automation.

  • Excellent interpersonal skills, with the ability to synthesize and convey technical and operational issues to executive stakeholders and multi-functional audiences.

  • Strong people leadership skills—you know how to develop engineers, grow new teams, and drive alignment and execution in high-pressure, high-visibility situations.

Ways to Stand Out from the Crowd
  • You’ve led the design and rollout of internal platforms or operational tooling used company-wide, especially in high-scale or hybrid cloud environments. You have hands-on experience integrating LLMs or other AI/ML technologies into engineering workflows—especially for automation, classification, or summarization use cases.

  • You’ve established or matured incident management programs that improved accountability, reduced MTTR, or enhanced learning from operational events. You’ve driven successful cross-functional efforts involving security, infrastructure, and product teams, and can point to measurable improvements in process or system resilience.

  • You bring a product-minded approach to internal tools—focusing not only on functionality, but also usability, adoption, and long-term maintainability. You have experience owning and scaling a service catalog program at the organizational level to improve service ownership, operational readiness, and engineering accountability across large, complex environments.

You will also be eligible for equity and .