Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud 
United States, Texas 
672028987

Yesterday
US, CA, Santa Clara
US, Remote
time type
Full time
posted on
Posted Yesterday
job requisition id

What you’ll be doing:

  • Contributing to deployments and daily operations of large scale next-generation GPU platforms

  • Handling incidents in GPU clusters, bridging the gap between cluster operations and development

  • Designing and implementing small features in the Base Command Manager product to become intimately familiar with the workings of the product

  • Validating complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability and resilience, ensuring they meet real-world customer scenarios.

What we need to see:

  • Bachelor's Degree or equivalent experience in Computer Science or related field.

  • 8+ years of experience in site reliability engineering and/or software development roles.

  • Fluency in Python

  • In-depth knowledge of Linux and networking

Ways to stand out from the crowd:

  • Experience with C++, high-performance computing, Kubernetes and/or system administration would be an asset

  • Previous experience as a system admin running BCM/Bright Cluster Manager/Base Command Manager clusters is a definite plus.

  • Proficiency with cluster networking including InfiniBand and Spectrum-X

You will also be eligible for equity and .