What you’ll be doing:
Develop mechanisms to launch and manage large compute jobs to support multi-modal foundation models for robotics. These will include data jobs, training jobs, evaluation jobs, and so forth.
Optimize GPU and cluster utilization for efficient model training, fine-tuning, and evaluation on massive datasets.
Develop robust observability tools and procedures for this compute infrastructure to ensure reliability and performance.
Collaborate with researchers to integrate innovative compute technologies into scalable training and eval pipelines.
What we need to see:
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience
5+ years of full-time industry experience in large-scale MLOps and AI infrastructure
Experience with ML frameworks like PyTorch, JAX, or TensorFlow.
Deep understanding of Kubernetes, experience with Ray
Experience with data frameworks and standards like SQL, Apache Spark, LanceDB
Experience of GPU acceleration and CUDA programming
Strong programming skills in Python and a high-performance language such as C++ for efficient system development.
Ways to stand out from the crowd:
Master’s or PhD’s degree in Computer Science, Robotics, Engineering, or a related field
Demonstrated Tech Lead experience, coordinating a team of engineers and driving projects from conception to deployment
Deep background at building and operating large-scale data infrastructure
Strong experience and curiosity in frontier AI research
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך