Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Nvidia Senior ML Storage Engineer - GPU Clusters 
United States, Texas 
34535020

Today
US, CA, Santa Clara
US, WA, Remote
US, WA, Redmond
US, WA, Seattle
time type
Full time
posted on
Posted 10 Days Ago
job requisition id

What you will be doing:

  • Research and implementation of distributed storage services

  • Design and implement scalable and efficient storage solutions tailored for data-intensive AI applications, optimizing performance and cost-effectiveness.

  • Continuously improve storage infrastructure provisioning, management, observability and day to day operation through automation.

  • Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.

  • Support a globally distributed on premise and cloud environments like AWS, GCP, Azure or OCI.

  • Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.

  • Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.

  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows and participate in the team's on-call rotation to support critical infrastructure.

  • Drive the evaluation and integration of storage solutions with new GPU - like GB200 - and cloud technologies to improve system performance.

What we need to see:

  • Minimum BS degree in Computer Science (or equivalent experience), with 6+ years managing high speed storage solutions deployed for GPU clusters or similar high-performance computing environments.

  • Expertise in designing, deploying, and running production-level cloud services.

  • Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must, including experience analyzing and tuning performance for a variety of AI/HPC workloads.

  • Experience architecture design and operation of storage solutions on any of the leading Cloud environment [AWS, Azure or GCP]

  • Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.

  • Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).

  • Proficient in modern CI/CD techniques, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.

  • Diligent with strong communication and documentation skills.

Ways to stand out from the crowd:

  • Experience running large-scale Slurm/LSF and/or BCM deployments in production environments.

  • Expertise in modern container networking and storage architecture.

  • Experience with Machine Learning and Deep Learning concepts, algorithms and models

  • Consistent record to define and drive operational excellence in highly distributed, high-performance environments.

You will also be eligible for equity and .