Expoint – all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud
United States, Texas
672028987

10.11.2025

שיתוף

התחבר/י כדי להגיש מועמדות

US, CA, Santa Clara

US, Remote

time type: Full time

posted on: Posted Yesterday

job requisition id

What you’ll be doing:

Contributing to deployments and daily operations of large scale next-generation GPU platforms
Handling incidents in GPU clusters, bridging the gap between cluster operations and development
Designing and implementing small features in the Base Command Manager product to become intimately familiar with the workings of the product
Validating complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability and resilience, ensuring they meet real-world customer scenarios.

What we need to see:

Bachelor's Degree or equivalent experience in Computer Science or related field.
8+ years of experience in site reliability engineering and/or software development roles.
Fluency in Python
In-depth knowledge of Linux and networking

Ways to stand out from the crowd:

Experience with C++, high-performance computing, Kubernetes and/or system administration would be an asset
Previous experience as a system admin running BCM/Bright Cluster Manager/Base Command Manager clusters is a definite plus.
Proficiency with cluster networking including InfiniBand and Spectrum-X

You will also be eligible for equity and .

פרטי המשרה המלאים

משרות נוספות שיכולות לעניין אותך

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer DGX Cloud India, Uttarakhand, Dehradun

כלי לבניית קורות חיים מקצועיים מבית אקספוינט

הצטרפו למאות שיצרו קורות חיים ושדרגו את הקריירה שלהם

צרו קו"ח