Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Engineering Manager - AI DevOps 
United States, Texas 
399498192

15.10.2025
US, CA, Santa Clara
US, CA, Remote
time type
Full time
posted on
Posted 19 Days Ago
job requisition id

What you'll be doing:

  • Supervise a team of DevOps engineers with expertise in AI inference infrastructure, test automation (SDET), and Infrastructure as Code (IaC)

  • Architect and implement scalable test automation strategies for AI inference workloads, including performance benchmarking and automated quality gates

  • Lead the maintenance of our GitHub First public CI infrastructure, focusing on single/multi-GPU testing, Kubernetes multi-node GPU testing, and CSP validation

  • Drive Infrastructure as Code efforts by employing Terraform, Ansible, and Kubernetes to support scaling across multiple clouds and lead GPU clusters effectively.

  • Attain operational proficiency encompassing 24x7 on-call rotations, SRE methodologies, automated monitoring, and self-repairing systems to guarantee uptime exceeding 99.9%

  • Lead release coordination, cost optimization, and management of multi-cloud deployments

What we need to see:

  • Bachelor's/Master's degree in Computer Science, Engineering, or equivalent experience

  • 4+ years leading DevOps/SRE organizations with direct SDET leadership experience

  • 8+ years hands-on experience in software development, test automation, or infrastructure engineering with AI/ML or GPU-intensive workloads

  • Proficiency in Infrastructure as Code (IaC) platforms: Terraform, Ansible, or CloudFormation with exposure to multiple cloud environments (AWS, GCP, Azure, OCI)

  • Strong technical leadership in test automation frameworks, CI/CD pipeline development, and quality engineering practices

  • Familiarity with containerization and orchestration tools such as Docker and Kubernetes for leading AI/ML workloads and GPU resources

  • Proven success building and scaling teams in fast-paced, high-growth environments

  • Effective interpersonal skills to collaborate with remote teams and build agreement

  • Proficiency in Python, Rust, or related programming languages along with the capability to engage in architecture conversations

  • Demonstrated history of operational proficiency encompassing 24x7 on-call oversight, SRE methodologies, and robust high-availability infrastructures

Ways to stand out from the crowd:

  • Experience with CI/CD (specifically GitHub Actions), releasing Open-source AI software

  • Proficient in Deep AI/ML infrastructure with expertise in NVIDIA technologies such as CUDA, TensorRT, Dynamo and Triton Inference Server, including coordinating GPU cluster operations and GPU workload performance benchmarking

  • Background in DevOps, system software testing, and previous experience leading teams on inference engines, model serving platforms, or AI acceleration frameworks

  • Track record with monitoring tools (Prometheus, Grafana), security scanning, static/dynamic analysis tools, and license compliance automation for critical AI inferencing frameworks.

You will also be eligible for equity and .