Our team is responsible for creating the networking software that connects massive AI accelerator clusters, focusing on SmartNIC integration, collective communication optimization, and ultra-high-bandwidth inter-rack connectivity. You'll be working at the intersection of cloud infrastructure and state-of-the-art AI hardware to solve some of the most challenging networking problems in distributed computing.
Key job responsibilities
Your responsibilities will include:* Design and develop high-performance networking software solutions utilizing RDMA and RoCE technologies for large-scale AI clusters
* Integrate SmartNIC acceleration hardware with EC2 control plane systems and APIs
* Implement and optimize collective communication patterns for distributed AI training workloads
* Develop comprehensive performance monitoring, metrics collection, and benchmarking tools for high-bandwidth cluster interconnects
* Create automated testing frameworks and stress testing tools for multi-rack distributed systems
* Debug complex system-level issues across hardware acceleration, kernel networking, and distributed applications
* Collaborate on architecture decisions for next-generation scale-out AI infrastructure
* Participate in design reviews, code reviews, and technical documentationAbout the team
Utility Computing (UC)
Diverse Experiences
AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying.
About AWSInclusive Team CultureWork/Life BalanceMentorship & Career Growth
We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Strong programming skills in C/C++ with focus on high-performance systems
- Experience with RDMA technologies and RoCE implementations
- Familiarity with collective communication libraries (NCCL, RCCL, OneCCL, MPI)
- Experience with Linux networking, kernel development, and distributed systems
- Understanding of high-performance computing clusters and parallel programming
- 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Bachelor's degree in computer science or equivalent
- Experience with SmartNIC programming and network acceleration hardware APIs
- Knowledge of large-scale AI training infrastructure and multi-rack cluster networking
- Experience with performance optimization, benchmarking, and system-level debugging
- Understanding of AI accelerator architectures and scale-out communication patterns
- Experience with cloud infrastructure integration and virtualization technologies
- Bachelor's degree in Computer Science, Computer Engineering, or related field
- Strong problem-solving skills and experience with complex distributed systems
- Proficiency in design and analysis of algorithms and data structures
- Linux operating system knowledge
- In-depth knowledge of TCP/IP
- Kernel or embedded development, particularly Linux kernel
- Strong knowledge of Computer Science fundamentals in data structures, algorithm design, problem solving, and complexity analysis
- Knowledge of, at least, one modern programming language such as C, C++, rust, Python or Perl
- Experience developing complex software systems that have been successfully delivered to customers
- Knowledge of professional software engineering practices & best practices for the full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations
- Ability to take a project from scoping requirements through actual launch of the project
- Experience in communicating with users, other technical teams, and management to collect requirements, describe software product features, and technical designs
- Experiencing mentoring junior software development engineers and driving engineering excellence
משרות נוספות שיכולות לעניין אותך