Expoint - all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Amazon Principal Network Development Engineer ML Networking 
Ireland, Dublin 
248298954

18.05.2025
DESCRIPTION

Key job responsibilitiesOver the next 12-18 months, they'll need to transform how we approach ML networking. This starts with developing new ways to identify and classify network traffic patterns from ML training, building systems that can automatically tune network configurations based on observed workload characteristics. They'll architect flexible abstractions that allow us to quickly adapt to new ML training patterns while maintaining peak performance for existing workloads.
The role requires someone who can move from theoretical understanding to practical implementation. They'll need to deliver a production-grade telemetry system that provides actionable insights about network performance, develop new approaches to baseline measurements, and demonstrate concrete performance improvements for key ML workloads. Success in this role means not just solving today's performance challenges, but building systems flexible enough to handle tomorrow's ML innovations.
This PE will be the technical authority for ML networking performance at AWS, working across teams to drive adoption of their approaches and establishing best practices that will shape how we build and operate our ML infrastructure for years to come.

BASIC QUALIFICATIONS

· A Masters Degree in Computer Science or Engineering, or equivalent experience is mandatory.
· Excellent IP networking fundamentals and extensive experience in the application of IP protocols.
· Expertise with major internet routing protocols; specifically, BGP, OSPF, MPLS, RSVP and ISIS
· Expertise with major router platforms; specifically, a deep technical understanding of all internal hardware components and experience with router system design.
· Expert level network analysis fundamentals and robust troubleshooting skills; specifically, network performance analysis.
· Ability to lead teams of engineers to deliver large scale solutions.
· Excellent written and verbal communication skills and an ability to interact efficiently with peers and customers is required.


PREFERRED QUALIFICATIONS

• Deep expertise in RDMA technologies (RoCEv2, EFA, InfiniBand)
• Strong understanding of ML training patterns and NCCL internals
• Experience with large-scale performance measurement systems
• Knowledge of ML frameworks and their distributed training implementations
• Expertise in network protocol design and optimization