Share
We seek an experienced engineer to work on distributed Artificial Intelligence/Machine Learning (AI/ML) systems. This role focuses on developing high-performance collective operations - the fundamental operations that enable AI to scale efficiently across multiple accelerators and servers. Most of our stack uses C/C++ at a relatively low level, requiring knowledge of Linux systems and performance-optimized code.We value experience with ML frameworks, performance tuning and optimization techniques, embedded systems, and high-speed networking interconnects. Experience optimizing ML workloads is particularly valuable for this role.Key job responsibilities
You'll work on the stack from ML collective frameworks to the libfabric and Elastic Fabric Adapter (EFA) stacks. Your focus will be designing and implementing Application Programming Interfaces (APIs) and features, as well as optimizing performance at every layer, reducing latency, and maximizing throughput for ML workloads on AWSA day in the life
We have mixed discipline orgs, you’d be working side by side with infrastructure experts, hardware engineers, RTL engineers, scientists & architects. Our workforce spans the globe and is truly international, you’ll find yourself working side by side with individuals from numerous countries. We take mentorship seriously, you can both expect senior mentorship and will be expected to mentor new and junior engineers.
- 5+ years of non-internship professional software development experience
- 5+ years of programming with at least one software programming language experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Experience as a mentor, tech lead or leading an engineering team
- Master's degree in computer science or equivalent
These jobs might be a good fit