המקום בו המומחים והחברות הטובות ביותר נפגשים
Key job responsibilities
You'll help develop and improve distributed training capabilities in popular machine learning frameworks (PyTorch and JAX) using AWS's specialized AI hardware. Working with our compiler and runtime teams, you'll learn how to optimize ML models to run efficiently on AWS's custom AI chips (Trainium and Inferentia). This is a great opportunity to bridge the gap between ML frameworks and hardware acceleration, while building strong foundations in distributed systems.We're looking for someone with solid programming skills, enthusiasm for learning complex systems, and basic understanding of machine learning concepts. This role offers excellent growth opportunities in the rapidly evolving field of ML infrastructure.
- To qualify, applicants should have earned (or will earn) a PhD between December 2023 and September 2025.
- Working knowledge of C++ and Python
- Experience with ML frameworks, particularly PyTorch and/or JAX
- Understanding of parallel computing concepts and CUDA programming
- Open source contributions or research publications to ML frameworks or tools, compilers, or distributed computing
- Experience optimizing ML workloads for performance
- Direct experience with PyTorch internals or CUDA optimization
- Hands-on experience with LLM infrastructure tools (e.g., vLLM, TensorRT)
משרות נוספות שיכולות לעניין אותך