Finding the best job has never been easier

Nvidia Distinguished Engineer AI Resiliency Lead
United States, Texas
271372075

01.12.2024

What You'll Be Doing:

Define a scalable software architecture to enable single-job resilient training on hundreds of thousands of GPUs with minimal downtime.
Design and deliver modular, resilient software features to support large-scale AI training for our top customers.
Innovate and evolve resilient architecture designs to achieve stringent uptime requirements (downtime < 1%), through solutions like in-memory check-pointing, in-process restart, and anomaly/SDC detection.
Collaborate closely with internal partners, spearheading successful project execution and communicating regular progress updates to senior leadership.

What We Need to See:

A Master’s or Ph.D. in Computer Science, Electrical or Computer Engineering from a top-tier university, or equivalent experience.
15+ years of experience in software architecture or related fields, with a deep understanding of AI-optimized systems.
Excellent and proven ability to collaborate and communicate effectively across multiple engineering teams.
At least 5 years of hands-on experience in software development on high-complexity projects involving HPC or AI.

Ways to Stand Out from the Crowd:

Proven experience with large-scale AI supercomputing applications, particularly in the training phase.
5+ years of experience with using and contributing to modern AI frameworks like PyTorch and JAX/XLA, specifically for large-scale training workloads.
A strong passion for designing system architectures tailored for AI, covering CPU, GPU, memory, storage, and networking.
Hands-on involvement in the entire lifecycle—from design to deployment—of large-scale High-Performance Computing (HPC) systems.
Experience in implementing HPC software development best practices in large-scale systems.

You will also be eligible for equity and .

These jobs might be a good fit

Capital One Distinguished Engineer - Customer Resiliency United States, Virginia, Arlington

Get to the top of the "yes list" with a standout CV!

CREATE CV