Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Amazon Sr Software Engineer EC2 Instance Networking 
United States, California, Sunnyvale 
767305298

16.06.2025
DESCRIPTION

Our team is responsible for creating the networking software that connects massive AI accelerator clusters, focusing on SmartNIC integration, collective communication optimization, and ultra-high-bandwidth inter-rack connectivity. As a senior engineer, you'll drive technical architecture decisions and lead the development of next-generation distributed AI training infrastructure.
Key job responsibilities
Your responsibilities will include:* Lead the design and development of high-performance networking software solutions utilizing RDMA and RoCE technologies for large-scale AI clusters
* Architect SmartNIC integration strategies with EC2 control plane systems and define API specifications
* Drive optimization of collective communication patterns and multi-rack networking protocols for distributed AI training
* Lead development of comprehensive performance monitoring, metrics collection, and benchmarking infrastructure
* Design automated testing frameworks and stress testing methodologies for large-scale distributed systems
* Lead complex system-level debugging efforts across hardware acceleration, kernel networking, and distributed applications
* Define technical architecture and strategy for next-generation scale-out AI cluster networking
* Provide technical leadership and mentoring to engineering teams
* Drive cross-functional collaboration with hardware, cloud infrastructure, and AI platform teams
* Lead technical design reviews and establish engineering best practicesAbout the team
Utility Computing (UC)
Diverse Experiences
AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying.
About AWSInclusive Team CultureWork/Life BalanceMentorship & Career Growth
We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.

BASIC QUALIFICATIONS

- 5+ years of non-internship professional software development experience
- 5+ years of programming with at least one software programming language experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Experience as a mentor, tech lead or leading an engineering team
- 5+ years of programming experience in C/C++ with focus on high-performance distributed systems
- 5+ years of leading design or architecture of large-scale networked systems
- Deep expertise in RDMA technologies, RoCE implementations, and high-performance networking
- Extensive experience with collective communication libraries (NCCL, RCCL, OneCCL, MPI)
- Experience as a technical lead or leading engineering teams on complex infrastructure projects


PREFERRED QUALIFICATIONS

- Expert-level experience with SmartNIC programming and network acceleration hardware APIs
- Deep knowledge of AI training infrastructure, cluster networking, and scale-out communication patterns
- Proven track record of performance optimization and system-level debugging in distributed environments
- Experience with cloud infrastructure integration, virtualization, and large-scale system deployment
- Understanding of modern AI accelerator architectures and multi-rack cluster design
- Experience building and optimizing systems for trillion-parameter model training workloads
- Track record of delivering complex technical projects in high-performance computing environments
- Strong communication and technical leadership skills
- Master's degree in Computer Science, Computer Engineering, or related field
- Experience with AWS cloud infrastructure and large-scale distributed system operations