Finding the best job has never been easier

Nvidia Senior DevOps Service Reliability Operations Engineer - DGX Cloud
United States, Texas
469276733

17.11.2025

US, CA, Santa Clara

US, Remote

What you will be doing:

The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States.
Some CIS shifts require either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4 days-per-week schedule) to ensure that the combination the US and India teams provide 24/7 coverage.
Every CIS team member will use alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and implement predictive support or diagnostic routines.
Perform systems administration tasks, network administration tasks, security incident monitoring to drive our actions.
CIS team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.
Help discover incidents and issues, including initiating the incident management procedure.
Bring in subject matter authorities or service owners as needed to resolve issues. Feedback will help us continually improve our service.
Your interpersonal skills will help keep the team engaged through resolution and ensure our clients believe we value their time and effort. May perform other tasks that will help us provide extraordinary service levels for our customers.

What we need to see:

Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles, and architects, coordinating effectively across organizational boundaries and geographies.
5+ years of experience administering large-scale production systems. 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC).
BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience.
Expert-level knowledge of Linux system administration and automation using Ansible and/or Python.
Strong experience with shell scripting, DNS, DHCP, storage systems, and core networking (IP Tables, routing, firewalls).
Experience with at least one workload manager (Slurm preferred) or job scheduling system in a production environment.
Strong experience troubleshooting and maintaining large-scale bare-metal infrastructure. Strong cross-team collaboration, documentation, and mentoring skills.
Experience improving processes for automation, reliability, and operational excellence.
Expertise using monitoring tools and problem ticketing systems. Strong problem-solving, analytical, and troubleshooting abilities.

Ways to Stand Out from the Crowd:

Advanced hands-on experience with Kubernetes, SLURM, and large-scale cluster management.
Familiarity with GPU hardware and high-performance computing environments.
Experience with observability and incident management tools (Grafana, OpenTelemetry, PagerDuty, JIRA). Cloud experience (AWS, Azure, GCP) is a plus; strong preference for on-prem expertise.

You will also be eligible for equity and .

These jobs might be a good fit

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

Get to the top of the "yes list" with a standout CV!

CREATE CV