Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Amazon Member Technical Staff Reinforcement Learning Infrastructure AGI Autonomy 
United States, California, San Francisco 
10548934

04.03.2025
DESCRIPTION

In this role, you will work closely with research teams to design, build, and maintain systems for training and evaluating state-of-the-art agent models.Key job responsibilities
* Develop cutting-edge training infrastructure to ensure large-scale reinforcement learning on LLMs runs highly efficient and robust.
* Work across the entire technology stack, including low level ML system, job orchestration and data management.
* Analyze, troubleshoot and profiling complex ML systems, identify and address performance bottlenecks.
* Work closely with researchers, conduct MLSys research to create new techniques, infrastructure, and tooling around emerging research capabilities.

BASIC QUALIFICATIONS

- PhD, or Master's degree and 3+ years of applied research experience
- Experience programming in Java, C++, Python or related language
- Experience with neural deep learning methods and machine learning
- Experience debugging ML systems


PREFERRED QUALIFICATIONS

- PhD in Computer Science, Machine Learning, or a related field, with a focus on ML System.
- Demonstrated experience in developing, implementing and debugging large scale ML systems.
- Experience with distributed system, Megatron, vLLM, Ray, and working with GPUs.
- Experience with patents or publications at top-tier peer-reviewed conferences or journals.Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.