Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Staff Site Reliability Engineer 
United States, California 
629817357

Yesterday
US, CA, Santa Clara
time type
Full time
posted on
Posted 9 Days Ago
job requisition id

What you’ll be doing:

  • Lead the technical strategy and roadmap for large-scale, cross-functional SRE initiatives that improve reliability, scalability, and developer productivity across enterprise systems.

  • Design, and build resilient distributed systems that power NVIDIA’s next-generation AI-driven enterprise products and services.

  • Drive automation and observability improvements, using metrics and analytics to enhance performance, reliability, and efficiency.

  • Collaborate across Cloud, Platform, Security, and AI/ML teams to implement modern SRE components that ensure high availability and secure operations.

  • Analyze and troubleshoot complex systems, championing best practices in system design, incident management, and postmortem analysis.

  • Mentor and influence engineers across teams, fostering technical excellence and a culture of reliability engineering.

What we need to see:

  • 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Cloud Architect roles.

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience

  • Strong proficiency in programming languages such as Python, Typescript, JavaScript, or Go, with a focus on automation andinfrastructure-as-code.

  • Experience withinfrastructure-as-codesuch as AWS CDK, AWS CloudFormation, Terraform or CrossPlane

  • Solid understanding of OpenTelemetry or other Observability implementation at scale.

  • Deep expertise in systems architecture, networking, Kubernetes, and public cloud services (AWS, Azure, or GCP).

  • Outstanding problem-solving, communication, and teamwork skills, with the ability to influence across technical and interpersonal boundaries.

Ways to stand out from the crowd:

  • Passion for and experience with Public Cloud or large-scale automation systems.

  • Demonstrated ability to drive technical strategy and deliver measurable reliability outcomes in complex environments.

  • A strong sense of ownership, curiosity, and innovation—you thrive in ambiguity and turn challenges into opportunities.

You will also be eligible for equity and .