Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Base Command Manager Engineer - NVIS NPI 
United States, California 
585690109

Today
US, CA, Santa Clara
time type
Full time
posted on
Posted 3 Days Ago
job requisition id

What you'll be doing

  • Play a key role in NVIDIA’s NPI team, acting as the link between engineering and the NVIS field team for cluster deployment and management solutions.

  • Collaborate closely with engineering and product teams to review and influence design decisions for products centered around large-scale, BCM-managed clusters.

  • Evaluate changes in BCM and underlying OS/software stacks, communicating the impact to the field organization to maintain robust and scalable deployment workflows.

  • Define and relay detailed cluster management requirements to engineering, enabling the successful New Product Introduction (NPI) of next-generation GPU platforms.

  • Describe architectural and design changes, build clear and actionable tasks for the field, including standardized deployment guides, configuration standard methodologies, and validation workflows.

  • Validate complex cluster configurations including Slurm and Kubernetes orchestrators for performance, scalability, and resilience, ensuring they meet real-world customer scenarios.

  • Support the NPI team by bridging knowledge gaps, tracking progress, and aligning collaborators throughout the product development lifecycle.

  • Support NVIDIA's mission by ensuring our breakthrough technologies are successfully deployed for global customers and OEM partners.

What we need to see

  • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).

  • 10+ years of experience in at least two of the following: HPC/large-scale cluster administration, Linux systems engineering, infrastructure automation (e.g., Ansible, Salt), or data center operations.

  • 5+ years of direct, hands-on experience provisioning, managing, and fixing clusters using NVIDIA Base Command Manager (BCM).

  • Deep, practical knowledge of how Slurm and Kubernetes is coordinated, deployed, and managed by BCM, including workload submission and resource management.

  • Proficiency in Python and Bash scripting for automation, cluster validation, and workflow optimization.

  • In-depth experience with cluster management and monitoring tools (e.g., Prometheus, Grafana, DCGM, and similar observability stacks).

  • Outstanding written and verbal communication skills, with the ability to explain complex technical concepts to both technical and non-technical collaborators.

  • A customer-first attitude, self-motivation, and a proactive approach to leadership in diverse environments.

Ways to stand out from the crowd:

  • Proficiency with cluster networking including InfiniBand and Spectrum-X.

  • Experience with NVIDIA Mission Control.

  • Familiarity with CI/CD workflows in an infrastructure context, including tools such as Git, GitLab, and Jenkins.

  • Background in Professional Services, customer-facing deployment, and solutions optimization.

  • Industry certifications such as CKA/CKAD (Certified KubernetesAdministrator/Developer),RHCE, or other advanced Linux/HPC credentials.

You will also be eligible for equity and .