דרושים Reliability Availability Serviceability Expert ב-אנבידיה ב-United States, Texas

The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States. Some CIS shifts require...

US, CA, Santa Clara

US, Remote

time type: Full time

posted on: Posted 2 Days Ago

job requisition id

What you will be doing:

The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States.
Some CIS shifts require either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4 days-per-week schedule) to ensure that the combination the US and India teams provide 24/7 coverage.
Every CIS team member will use alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and implement predictive support or diagnostic routines.
Perform systems administration tasks, network administration tasks, security incident monitoring to drive our actions.
CIS team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.
Help discover incidents and issues, including initiating the incident management procedure.
Bring in subject matter authorities or service owners as needed to resolve issues. Feedback will help us continually improve our service.
Your interpersonal skills will help keep the team engaged through resolution and ensure our clients believe we value their time and effort. May perform other tasks that will help us provide extraordinary service levels for our customers.

What we need to see:

Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles, and architects, coordinating effectively across organizational boundaries and geographies.
5+ years of experience administering large-scale production systems. 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC).
BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience.
Expert-level knowledge of Linux system administration and automation using Ansible and/or Python.
Strong experience with shell scripting, DNS, DHCP, storage systems, and core networking (IP Tables, routing, firewalls).
Experience with at least one workload manager (Slurm preferred) or job scheduling system in a production environment.
Strong experience troubleshooting and maintaining large-scale bare-metal infrastructure. Strong cross-team collaboration, documentation, and mentoring skills.
Experience improving processes for automation, reliability, and operational excellence.
Expertise using monitoring tools and problem ticketing systems. Strong problem-solving, analytical, and troubleshooting abilities.

Ways to Stand Out from the Crowd:

Advanced hands-on experience with Kubernetes, SLURM, and large-scale cluster management.
Familiarity with GPU hardware and high-performance computing environments.
Experience with observability and incident management tools (Grafana, OpenTelemetry, PagerDuty, JIRA). Cloud experience (AWS, Azure, GCP) is a plus; strong preference for on-prem expertise.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud United States, Texas

09.11.2025

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud United States, Texas

שיתוף

Contributing to deployments and daily operations of large scale next-generation GPU platforms. Handling incidents in GPU clusters, bridging the gap between cluster operations and development. Designing and implementing small features...

US, CA, Santa Clara

US, Remote

time type: Full time

posted on: Posted Yesterday

job requisition id

What you’ll be doing:

Contributing to deployments and daily operations of large scale next-generation GPU platforms
Handling incidents in GPU clusters, bridging the gap between cluster operations and development
Designing and implementing small features in the Base Command Manager product to become intimately familiar with the workings of the product
Validating complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability and resilience, ensuring they meet real-world customer scenarios.

What we need to see:

Bachelor's Degree or equivalent experience in Computer Science or related field.
8+ years of experience in site reliability engineering and/or software development roles.
Fluency in Python
In-depth knowledge of Linux and networking

Ways to stand out from the crowd:

Experience with C++, high-performance computing, Kubernetes and/or system administration would be an asset
Previous experience as a system admin running BCM/Bright Cluster Manager/Base Command Manager clusters is a definite plus.
Proficiency with cluster networking including InfiniBand and Spectrum-X

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

08.11.2025

Nvidia Site Reliability Engineer HPC LSF United States, Texas

שיתוף

Troubleshoot incoming support requests in a large-scale HPC environment. Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation. Ensure compute...

time type: Full time

posted on: Posted 4 Days Ago

job requisition id

What you’ll be doing:

Troubleshoot incoming support requests in a large-scale HPC environment.
Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring and day to day operation through automation.
Ensure compute servers are running correct Operating System and configuration.
Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
Collaborate with specialist teams to drive issues to closure.
Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.
Directly contribute to the overall quality and improve time to market for our next generation chips.

What we need to see:

Proficient in administering Centos/RHEL Linux distributions.
Understating of container technologies like Docker.
Proficiency in Python and UNIX scripting languages such as bash.
Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.
BS in Computer Science, similar degree (or equivalent experience) with 2+yrs of relevant post degree experience.
Solid understanding of cluster configuration managements tools such as Ansible.

Ways to stand out from the crowd:

Understanding of key Linux technologies such as NFS, automounter, LDAP, DNS, and TCP/IP networking in Red Hat Linux distribution flavors.
Familiarity with job scheduler administration (e.g. IBM Spectrum LSF or SLURM) and experience building/ operating large scale compute infrastructure.
Knowledge of the FlexLM license management system.
Proficiency in Perl for maintaining legacy automation scripts.
Familiarity with High-Speed Networking (InfiniBand, RDMA, RoCE etc.) and fast, distributed storage systems (Lustre, GPFS, etc.)

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

19.10.2025

Nvidia Senior Site Reliability Engineer IaaS PaaS United States, Texas

שיתוף

Design, build, and implement scalable cloud-based systems for PaaS/IaaS. Work closely with other teams on new products orfeatures/improvementsof existing products. Develop, maintain and improve cloud deployment of our software. Participate...

time type: Full time

posted on: Posted 5 Days Ago

job requisition id

What you'll be doing:

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

Design, build, and implement scalable cloud-based systems for PaaS/IaaS.
Work closely with other teams on new products orfeatures/improvementsof existing products.
Develop, maintain and improve cloud deployment of our software.
Participate in the triage & resolution of complex infra-related issues
Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.
Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces
Develop, maintain and improve automation tools that can help improve efficiency of SRE operations
Practice balanced incident response and blameless postmortems
Be part of an on-call rotation to support production systems

What we need to see:

BS or MS in Computer Science or equivalent program (or equivalent experience).
8+ years of hands-on software engineering or equivalent experience.
Experience programming with Go & Python, React.
Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.
Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.
Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics
Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI/CD.
Exhibit knowledge in concepts of working with CSPs, for example: AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.

Ways to stand out from the crowd:

Expertise in technologies such as StackStorm, OpenStack, Red Hat OpenShift, and AI DBs like Milvus.
A track record of solving complex problems with elegant solutions.
Demonstrate delivery of complex projects in previous roles.
Showcase ability in developing Frontend application with concepts of SSA, RBAC

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

15.10.2025

Nvidia Senior Service Reliability Operations Administrator United States, Texas

שיתוף

The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States. Some CIS shifts require...

US, CA, Santa Clara

US, Remote

time type: Full time

posted on: Posted 2 Days Ago

job requisition id

What you will be doing:

The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States.
Some CIS shifts require either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4 days-per-week schedule) to ensure that the combination the US and India teams provide 24/7 coverage.
Every CIS team member will use alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and implement predictive support or diagnostic routines.
Perform systems administration tasks, network administration tasks, security incident monitoring to drive our actions.
CIS team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.
Help discover incidents and issues, including initiating the incident management procedure. Bring in subject matter authorities or service owners as needed to resolve issues. Feedback will help us continually improve our service.
Your interpersonal skills will help keep the team engaged through resolution and ensure our clients believe we value their time and effort.
May perform other tasks that will help us provide extraordinary service levels for our customers.

What we need to see:

5+ years of experience administering open system servers in a Production environment. 3+ years of experience working in demanding Internet, Cloud, or Telecommunications environments in a Systems Administration, DevOps, SRE, or NOC role.
B.S. in relevant disciplines or equivalent experience.
Expertise using monitoring tools and problem ticketing systems.
Strong problem-solving, analytical, and troubleshooting abilities.
Strong server administration experience. Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc. RHCE or equivalent level of knowledge.
Experience scripting in Python preferred, but not required. Prior experience running virtual machines under open source or commercial hypervisors. Experience operating services running on public or private clouds.
Knowledge and understanding of application containers and container orchestration systems. Basic understanding of Git.
Experience performing system administration tasks using Ansible. Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.
Demonstrate ability to master and maintain complicated environments.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

14.10.2025

Nvidia Senior Site Reliability Engineer United States, Texas

שיתוף

Own the solutions you build, collaborating with cross-functional teams to successfully implement them. Collaborate with various teams in a fast-paced environment to ensure seamless project completion. Continuously improve solution provisioning...

time type: Full time

posted on: Posted 29 Days Ago

job requisition id

What you'll be doing:

Own the solutions you build, collaborating with cross-functional teams to successfully implement them.
Collaborate with various teams in a fast-paced environment to ensure seamless project completion.
Continuously improve solution provisioning and management through automation.
Identify areas to improve service resiliency using industry-standard practices.
Detect performance issues and recommend solutions to maintain world-class service quality.
Conduct capacity management and planning to meet ongoing operational needs.
Participate in incident reviews, assist in root cause identification, and write RCA reports.
Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.
Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.
Participate in the team's on-call rotation.

What we need to see:

B.S. degree in Computer Science or related technical field (or equivalent experience) with over 12+ years in building and supporting critical services.
Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).
Deep understanding of Linux operating systems and TCP/IP fundamentals.
Expertise with at least one major cloud service provider - AWS, GCP, Azure.
Demonstrated proficiency with end-to-end SRE capabilities and observability.
Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.
5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.
Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

Linux certification from a well-known vendor - RedHat, Oracle, etc.
Prior experience managing large-scale Kubernetes deployment in production.
Strong skills in modern container networking and storage architecture.
Well-known Cloud Certification(s).
Hands-on experience working with Slurm/LSF environments.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

13.10.2025

Nvidia Senior Site Reliability Engineer United States, Texas

שיתוף

time type: Full time

posted on: Posted 29 Days Ago

job requisition id

What you'll be doing:

Own the solutions you build, collaborating with cross-functional teams to successfully implement them.
Collaborate with various teams in a fast-paced environment to ensure seamless project completion.
Continuously improve solution provisioning and management through automation.
Identify areas to improve service resiliency using industry-standard practices.
Detect performance issues and recommend solutions to maintain world-class service quality.
Conduct capacity management and planning to meet ongoing operational needs.
Participate in incident reviews, assist in root cause identification, and write RCA reports.
Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.
Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.
Participate in the team's on-call rotation.

What we need to see:

B.S. degree in Computer Science or related technical field (or equivalent experience) with 5+ years in building and supporting critical services.
Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).
Deep understanding of Linux operating systems and TCP/IP fundamentals.
Expertise with at least one major cloud service provider - AWS, GCP, Azure.
Demonstrated proficiency with end-to-end SRE capabilities and observability.
Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.
5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.
Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

Linux certification from a well-known vendor - RedHat, Oracle, etc.
Prior experience managing large-scale Kubernetes deployment in production.
Strong skills in modern container networking and storage architecture.
Experience designing AI chatbots and agentic automation workflows
Hands-on experience working with Slurm/LSF environments.

You will also be eligible for equity and .

NvidiaSenior DevOps Service Reliability Operations Engineer - DGX Cloud

משרות נוספות שיכולות לעניין אותך

1 2 3 4

United States, Texas

469276733

16.11.2025

שיתוף

The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States. Some CIS shifts require...

תיאור:

US, CA, Santa Clara

US, Remote

time type: Full time

posted on: Posted 2 Days Ago

job requisition id

What you will be doing:

The team will provide their services 24/7 with a follow-the-sun environment which will span continents. You will report directly to a manager in the United States.
Some CIS shifts require either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4 days-per-week schedule) to ensure that the combination the US and India teams provide 24/7 coverage.
Every CIS team member will use alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and implement predictive support or diagnostic routines.
Perform systems administration tasks, network administration tasks, security incident monitoring to drive our actions.
CIS team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.
Help discover incidents and issues, including initiating the incident management procedure.
Bring in subject matter authorities or service owners as needed to resolve issues. Feedback will help us continually improve our service.
Your interpersonal skills will help keep the team engaged through resolution and ensure our clients believe we value their time and effort. May perform other tasks that will help us provide extraordinary service levels for our customers.

What we need to see:

Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles, and architects, coordinating effectively across organizational boundaries and geographies.
5+ years of experience administering large-scale production systems. 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC).
BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience.
Expert-level knowledge of Linux system administration and automation using Ansible and/or Python.
Strong experience with shell scripting, DNS, DHCP, storage systems, and core networking (IP Tables, routing, firewalls).
Experience with at least one workload manager (Slurm preferred) or job scheduling system in a production environment.
Strong experience troubleshooting and maintaining large-scale bare-metal infrastructure. Strong cross-team collaboration, documentation, and mentoring skills.
Experience improving processes for automation, reliability, and operational excellence.
Expertise using monitoring tools and problem ticketing systems. Strong problem-solving, analytical, and troubleshooting abilities.

Ways to Stand Out from the Crowd:

Advanced hands-on experience with Kubernetes, SLURM, and large-scale cluster management.
Familiarity with GPU hardware and high-performance computing environments.
Experience with observability and incident management tools (Grafana, OpenTelemetry, PagerDuty, JIRA). Cloud experience (AWS, Azure, GCP) is a plus; strong preference for on-prem expertise.

You will also be eligible for equity and .