

Share
Architect, develop, and maintain Python-based tools and services to efficiently run a performance-focused multi-tenant Linux cluster including embedded, desktop, and server systems
Work with industry standard tools (Kubernetes, Slurm, Ansible, Gitlab, Artifactory, Jira)
Actively support users doing development, functional testing, and performance testing on current and pre-production GPU cluster systems
Work with various teams at NVIDIA across different timezones to incorporate and influence the latest tools for operating GPU clusters
Collaborate with users and system administrators to seek out ways to improve UX and operational efficiency
Become an expert on the entire AI infrastructure stack
BS or higher degree in computer science with 4+ years of relevant experience
Adept programming skills in multiple languages including Python
In-depth experience with distributed systems and cluster management stacks (logging, monitoring, scheduling, etc.)
Hands-on experience with continuous integration and deployment tools (e.g. GitlabCI)
Outstanding ability to understand users, prioritize among many contending requests, and build consensus
Passion for “it just works” automation, eliminating repetitive tasks, and enabling team members
Deep understanding of Linux system administration and container technologies
Proficient English communication skills
Experience automating operations for bare-metal clusters
Experience with GPU computing systems
Track record of identifying useful new technologies or methods and incorporating them into SW development flows
Experience as an active contributor to a SW project involving many developers or as a maintainer of open-source software
These jobs might be a good fit