Expoint – all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud
United States, Texas
672028987

10.11.2025

Share

Log in to apply

US, CA, Santa Clara

US, Remote

time type: Full time

posted on: Posted Yesterday

job requisition id

What you’ll be doing:

Contributing to deployments and daily operations of large scale next-generation GPU platforms
Handling incidents in GPU clusters, bridging the gap between cluster operations and development
Designing and implementing small features in the Base Command Manager product to become intimately familiar with the workings of the product
Validating complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability and resilience, ensuring they meet real-world customer scenarios.

What we need to see:

Bachelor's Degree or equivalent experience in Computer Science or related field.
8+ years of experience in site reliability engineering and/or software development roles.
Fluency in Python
In-depth knowledge of Linux and networking

Ways to stand out from the crowd:

Experience with C++, high-performance computing, Kubernetes and/or system administration would be an asset
Previous experience as a system admin running BCM/Bright Cluster Manager/Base Command Manager clusters is a definite plus.
Proficiency with cluster networking including InfiniBand and Spectrum-X

You will also be eligible for equity and .

Full job details

These jobs might be a good fit

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer DGX Cloud India, Uttarakhand, Dehradun

Professional CV Builder tool from Expoint.

Get to the top of the "yes list" with a standout CV!

CREATE CV