Finding the best job has never been easier
Share
In this role, you will work closely with research teams to design, build, and maintain systems for training and evaluating state-of-the-art agent models.Key job responsibilities
* Develop cutting-edge training infrastructure to ensure large-scale reinforcement learning on LLMs runs highly efficient and robust.
* Work across the entire technology stack, including low level ML system, job orchestration and data management.
* Analyze, troubleshoot and profiling complex ML systems, identify and address performance bottlenecks.
* Work closely with researchers, conduct MLSys research to create new techniques, infrastructure, and tooling around emerging research capabilities.
- PhD, or Master's degree and 3+ years of applied research experience
- Experience programming in Java, C++, Python or related language
- Experience with neural deep learning methods and machine learning
- Experience debugging ML systems
- PhD in Computer Science, Machine Learning, or a related field, with a focus on ML System.
- Demonstrated experience in developing, implementing and debugging large scale ML systems.
- Experience with distributed system, Megatron, vLLM, Ray, and working with GPUs.
- Experience with patents or publications at top-tier peer-reviewed conferences or journals.Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.
These jobs might be a good fit