

What you’ll be doing:
Contributing to deployments and daily operations of large scale next-generation GPU platforms
Handling incidents in GPU clusters, bridging the gap between cluster operations and development
Designing and implementing small features in the Base Command Manager product to become intimately familiar with the workings of the product
Validating complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability and resilience, ensuring they meet real-world customer scenarios.
What we need to see:
Bachelor's Degree or equivalent experience in Computer Science or related field.
8+ years of experience in site reliability engineering and/or software development roles.
Fluency in Python
In-depth knowledge of Linux and networking
Ways to stand out from the crowd:
Experience with C++, high-performance computing, Kubernetes and/or system administration would be an asset
Previous experience as a system admin running BCM/Bright Cluster Manager/Base Command Manager clusters is a definite plus.
Proficiency with cluster networking including InfiniBand and Spectrum-X
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך