Expoint - all jobs in one place

המקום בו המומחים והחברות הטובות ביותר נפגשים

Limitless High-tech career opportunities - Expoint

Nvidia Senior AI Infrastructure Engineer 
United States, California 
364465871

01.12.2024

What you'll be doing:

  • Administer an NVIDIA Internal AI cluster composed of Linux systems ranging from the world’s most powerful servers to embedded systems

  • Maintain the configuration of our resource management system (SLURM) to keep resource allocation efficient and aligned with organizational priorities

  • Automate configuration management, software updates, and maintenance of system availability using modern DevOps tools (Ansible, Gitlab, etc.)

  • Plan and maintain new systems that support the NVIDIA Software stack

  • Work directly with developers and hardware architects to debug issues, identify new requirements, and improve workflows

  • Actively communicate with users and management regarding resource planning and allocation

What we need to see:

  • 5+ years of previous experience deploying and administering large scale clusters, tuned for development efforts in AI

  • MS in Computer Science, Computer Engineering, or EECE; or a BS (or equivalent experience).

  • Deep knowledge of distributed resource scheduling systems (Slurm (preferred), LSF, etc.)

  • Demonstrated ability to script in bash, and at least one high-level language (Python preferred)

  • Experience with container technologies (Docker, Singularity, etc.)

  • Deep understanding of operating systems, computer networks, and high-performance hardware

  • Ability to work well with developers, hardware architects, & test engineers

  • Passionate dedication to providing quality support for users

Ways to stand out from the crowd:

  • Prior work experience managing high performance fabrics and parallel file systems

  • Familiarity with CUDA and managing GPU-accelerated computing systems

  • Basic knowledge of deep learning frameworks and algorithms

You will also be eligible for equity and .