מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Nvidia Senior ML Storage Engineer - GPU Clusters
United States, Texas
34535020

12.08.2025

שיתוף

US, CA, Santa Clara

US, WA, Remote

US, WA, Redmond

US, WA, Seattle

What you will be doing:

Research and implementation of distributed storage services
Design and implement scalable and efficient storage solutions tailored for data-intensive AI applications, optimizing performance and cost-effectiveness.
Continuously improve storage infrastructure provisioning, management, observability and day to day operation through automation.
Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.
Support a globally distributed on premise and cloud environments like AWS, GCP, Azure or OCI.
Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.
Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.
Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows and participate in the team's on-call rotation to support critical infrastructure.
Drive the evaluation and integration of storage solutions with new GPU - like GB200 - and cloud technologies to improve system performance.

What we need to see:

Minimum BS degree in Computer Science (or equivalent experience), with 6+ years managing high speed storage solutions deployed for GPU clusters or similar high-performance computing environments.
Expertise in designing, deploying, and running production-level cloud services.
Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must, including experience analyzing and tuning performance for a variety of AI/HPC workloads.
Experience architecture design and operation of storage solutions on any of the leading Cloud environment [AWS, Azure or GCP]
Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.
Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).
Proficient in modern CI/CD techniques, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.
Diligent with strong communication and documentation skills.

Ways to stand out from the crowd:

Experience running large-scale Slurm/LSF and/or BCM deployments in production environments.
Expertise in modern container networking and storage architecture.
Experience with Machine Learning and Deep Learning concepts, algorithms and models
Consistent record to define and drive operational excellence in highly distributed, high-performance environments.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

Nvidia Senior AI ML Storage Infra Software Engineer GPU Clusters United States, Texas

הצטרפו למאות שיצרו קורות חיים ושדרגו את הקריירה שלהם

צרו קו"ח