המקום בו המומחים והחברות הטובות ביותר נפגשים
What You’ll Be Doing:
Develop AI Software Resiliency Features:Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
Hands-On Coding & Optimization:Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs.
Fault Tolerance & Debugging:Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation.
Collaborate Across Teams:Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA.
Testing & Automation:Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads.
Support Production Deployments:Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads.
What We Need to See:
You've achieved a Bachelor’s, Master’s or PhDin Computer Science, Electrical Engineering, or a related field, or equivalent experience.
Proficiency in C++ and Python, with experience in writing efficient, high-performance code.
6+ years of relevant experience
Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments.
Familiarity with AI frameworkssuch as PyTorch, JAX/XLA, TensorFlow, or similar.
Experience with debugging and profiling tools(e.g., gdb, perf, valgrind, NVIDIA Nsight).
Excellent problem-solving skills
Ways to Stand Out From the Crowd:
training models or working with model training teams.
Hands-on experience withCUDA, NCCL, or MPIfor GPU-accelerated computing, especially atextreme-scale.
Knowledge ofcheckpointing strategies, error mitigation, or fault-tolerant computingin AI training.
Experience working withlarge-scale AI clusters, HPC environments, or cloud-based AI workloads.
Strongsystems programming skillsand experience with low-level performance tuning.
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך