Share
Key job responsibilities
You'll join one of our core ML teams - Frameworks, Distributed Training, or Inference - to enhance machine learning capabilities on AWS's specialized AI hardware. Your responsibilities will include improving PyTorch and JAX for distributed training on Trainium chips, optimizing ML models for efficient inference on Inferentia processors, and collaborating with compiler and runtime teams to maximize hardware performance. You'll also develop and integrate new features in ML frameworks to support AWS AI services. We seek candidates with strong programming skills, eagerness to learn complex systems, and basic ML knowledge. This role offers growth opportunities in ML infrastructure, bridging the gap between frameworks, distributed systems, and hardware acceleration.
- To qualify, applicants should have earned (or will earn) a Bachelors or Masters degree between December 2022 and September 2025.
- Working knowledge of C++ and Python
- Experience with ML frameworks, particularly PyTorch, Jax, and/or vLLM
- Understanding of parallel computing concepts and CUDA programming
- Open source contributions to ML frameworks or tools
- Experience optimizing ML workloads for performance
- Direct experience with PyTorch internals or CUDA optimization
- Hands-on experience with LLM infrastructure tools (e.g., vLLM, TensorRT)
These jobs might be a good fit