Expoint - all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Amazon ML Infrastructure Engineer - Distributed Training 
United States, California, Cupertino 
565207624

27.04.2025
DESCRIPTION

Key job responsibilities
You'll help develop and improve distributed training capabilities in popular machine learning frameworks (PyTorch and JAX) using AWS's specialized AI hardware. Working with our compiler and runtime teams, you'll learn how to optimize ML models to run efficiently on AWS's custom AI chips (Trainium and Inferentia). This is a great opportunity to bridge the gap between ML frameworks and hardware acceleration, while building strong foundations in distributed systems.We're looking for someone with solid programming skills, enthusiasm for learning complex systems, and basic understanding of machine learning concepts. This role offers excellent growth opportunities in the rapidly evolving field of ML infrastructure.

BASIC QUALIFICATIONS


- To qualify, applicants should have earned (or will earn) a Bachelors or Masters degree between December 2022 and September 2025.
- Working knowledge of C++ and Python
- Experience with ML frameworks, particularly PyTorch and/or JAX
- Understanding of parallel computing concepts and CUDA programming


PREFERRED QUALIFICATIONS

- Open source contributions to ML frameworks or tools
- Experience optimizing ML workloads for performance
- Direct experience with PyTorch internals or CUDA optimization
- Hands-on experience with LLM infrastructure tools (e.g., vLLM, TensorRT)