Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior Software Engineer AI Resiliency 
United States, Texas 
776303694

14.04.2025
US, CA, Santa Clara
US, WA, Redmond
time type
Full time
posted on
Posted 28 Days Ago
job requisition id

What You’ll Be Doing:

  • Develop AI Software Resiliency Features:Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.

  • Hands-On Coding & Optimization:Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.

  • Fault Tolerance & Debugging:Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.

  • Collaborate Across Teams:Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.

  • Testing & Automation:Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.

  • Support Production Deployments:Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.


What We Need to See:

  • You've achieved a Bachelor’s, Master’s or PhDin Computer Science, Electrical Engineering, or a related field, or equivalent experience.

  • Proficiency in C++ and Python, with experience in writing efficient, high-performance code.

  • 6+ years of relevant experience

  • Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.

  • Familiarity with AI frameworkssuch as PyTorch, JAX/XLA, TensorFlow, or similar.

  • Experience with debugging and profiling tools(e.g., gdb, perf, valgrind, NVIDIA Nsight).

  • Excellent problem-solving skills

Ways to Stand Out From the Crowd:

  • training models or working with model training teams.

  • Hands-on experience withCUDA, NCCL, or MPIfor GPU-accelerated computing, especially atextreme-scale.

  • Knowledge ofcheckpointing strategies, error mitigation, or fault-tolerant computingin AI training.

  • Experience working withlarge-scale AI clusters, HPC environments, or cloud-based AI workloads.

  • Strongsystems programming skillsand experience with low-level performance tuning.

You will also be eligible for equity and .