Finding the best job has never been easier

Nvidia Senior Software Engineer AI Resiliency
United States, Texas
776303694

14.04.2025

US, CA, Santa Clara

US, WA, Redmond

What You’ll Be Doing:

Develop AI Software Resiliency Features:Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
Hands-On Coding & Optimization:Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.
Fault Tolerance & Debugging:Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.
Collaborate Across Teams:Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.
Testing & Automation:Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.
Support Production Deployments:Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.

What We Need to See:

You've achieved a Bachelor’s, Master’s or PhDin Computer Science, Electrical Engineering, or a related field, or equivalent experience.
Proficiency in C++ and Python, with experience in writing efficient, high-performance code.
6+ years of relevant experience
Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.
Familiarity with AI frameworkssuch as PyTorch, JAX/XLA, TensorFlow, or similar.
Experience with debugging and profiling tools(e.g., gdb, perf, valgrind, NVIDIA Nsight).
Excellent problem-solving skills

Ways to Stand Out From the Crowd:

training models or working with model training teams.
Hands-on experience withCUDA, NCCL, or MPIfor GPU-accelerated computing, especially atextreme-scale.
Knowledge ofcheckpointing strategies, error mitigation, or fault-tolerant computingin AI training.
Experience working withlarge-scale AI clusters, HPC environments, or cloud-based AI workloads.
Strongsystems programming skillsand experience with low-level performance tuning.

You will also be eligible for equity and .

These jobs might be a good fit

Dell Enterprise Resiliency Senior Advisor United States, Texas, Round Rock

Get to the top of the "yes list" with a standout CV!

CREATE CV