מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Nvidia Senior Software Engineer AI Resiliency
United States, Texas
776303694

14.04.2025

שיתוף

US, CA, Santa Clara

US, WA, Redmond

What You’ll Be Doing:

Develop AI Software Resiliency Features:Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
Hands-On Coding & Optimization:Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.
Fault Tolerance & Debugging:Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.
Collaborate Across Teams:Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.
Testing & Automation:Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.
Support Production Deployments:Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.

What We Need to See:

You've achieved a Bachelor’s, Master’s or PhDin Computer Science, Electrical Engineering, or a related field, or equivalent experience.
Proficiency in C++ and Python, with experience in writing efficient, high-performance code.
6+ years of relevant experience
Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.
Familiarity with AI frameworkssuch as PyTorch, JAX/XLA, TensorFlow, or similar.
Experience with debugging and profiling tools(e.g., gdb, perf, valgrind, NVIDIA Nsight).
Excellent problem-solving skills

Ways to Stand Out From the Crowd:

training models or working with model training teams.
Hands-on experience withCUDA, NCCL, or MPIfor GPU-accelerated computing, especially atextreme-scale.
Knowledge ofcheckpointing strategies, error mitigation, or fault-tolerant computingin AI training.
Experience working withlarge-scale AI clusters, HPC environments, or cloud-based AI workloads.
Strongsystems programming skillsand experience with low-level performance tuning.

You will also be eligible for equity and .

משרות נוספות שיכולות לעניין אותך

Dell Enterprise Resiliency Senior Advisor United States, Texas, Round Rock

הצטרפו למאות שיצרו קורות חיים ושדרגו את הקריירה שלהם

צרו קו"ח