Expoint - all jobs in one place

המקום בו המומחים והחברות הטובות ביותר נפגשים

Limitless High-tech career opportunities - Expoint

Amazon PhD ML Infrastructure Engineer - Distributed 
United States, California, Cupertino 
292200444

Today
DESCRIPTION

Key job responsibilities
You'll help develop and improve distributed training capabilities in popular machine learning frameworks (PyTorch and JAX) using AWS's specialized AI hardware. Working with our compiler and runtime teams, you'll learn how to optimize ML models to run efficiently on AWS's custom AI chips (Trainium and Inferentia). This is a great opportunity to bridge the gap between ML frameworks and hardware acceleration, while building strong foundations in distributed systems.We're looking for someone with solid programming skills, enthusiasm for learning complex systems, and basic understanding of machine learning concepts. This role offers excellent growth opportunities in the rapidly evolving field of ML infrastructure.

BASIC QUALIFICATIONS


- To qualify, applicants should have earned (or will earn) a PhD between December 2023 and September 2025.
- Working knowledge of C++ and Python
- Experience with ML frameworks, particularly PyTorch and/or JAX
- Understanding of parallel computing concepts and CUDA programming


PREFERRED QUALIFICATIONS

- Open source contributions or research publications to ML frameworks or tools, compilers, or distributed computing
- Experience optimizing ML workloads for performance
- Direct experience with PyTorch internals or CUDA optimization
- Hands-on experience with LLM infrastructure tools (e.g., vLLM, TensorRT)