Finding the best job has never been easier
Share
What You'll Be Doing:
Define a scalable software architecture to enable single-job resilient training on hundreds of thousands of GPUs with minimal downtime.
Design and deliver modular, resilient software features to support large-scale AI training for our top customers.
Innovate and evolve resilient architecture designs to achieve stringent uptime requirements (downtime < 1%), through solutions like in-memory check-pointing, in-process restart, and anomaly/SDC detection.
Collaborate closely with internal partners, spearheading successful project execution and communicating regular progress updates to senior leadership.
A Master’s or Ph.D. in Computer Science, Electrical or Computer Engineering from a top-tier university, or equivalent experience.
15+ years of experience in software architecture or related fields, with a deep understanding of AI-optimized systems.
Excellent and proven ability to collaborate and communicate effectively across multiple engineering teams.
At least 5 years of hands-on experience in software development on high-complexity projects involving HPC or AI.
Proven experience with large-scale AI supercomputing applications, particularly in the training phase.
5+ years of experience with using and contributing to modern AI frameworks like PyTorch and JAX/XLA, specifically for large-scale training workloads.
A strong passion for designing system architectures tailored for AI, covering CPU, GPU, memory, storage, and networking.
Hands-on involvement in the entire lifecycle—from design to deployment—of large-scale High-Performance Computing (HPC) systems.
Experience in implementing HPC software development best practices in large-scale systems.
You will also be eligible for equity and .
These jobs might be a good fit