Expoint – all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

Microsoft Principal Network Architect 
Taiwan, Taoyuan City 
517811832

16.10.2025

Required/minimum qualifications

  • Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 9+ years technical engineering experience OR Bachelor's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 11+ years technical engineering experience OR equivalent experience.
  • 10+ years designing and operating large-scale L2/L3 Ethernet fabrics for HPC/AI or hyperscale services.
  • 5+ years of experience with Ethernet, RDMA/RoCEv2, congestion control (ECN/PFC, DCQCN, HPCC, TIMELY), routing (BGP/ECMP, IS-IS/OSPF), and load balancing (CONGA/HULA/PLB).
  • 5+ years of experience with of switch/NIC architecture (ASIC pipelines, queueing/scheduling, buffers, telemetry, hash/ECMP behaviors) and optics (DR/FR/LR, PAM-4, FEC).
  • 5+ years of experience with traffic generation and analysis (ixia/Keysight, TRex, pktgen, iperf, perfetto), switch/NIC telemetry, and packet capture (INT, ERSPAN, SPAN, pcaps).
  • 3+ years of experience managing engineers (hiring, mentoring, performance management, org health).

Other Requirements:

  • Abilityto meet Microsoft, customer and/or government security screening requirementsarerequired for this role. These requirements include, but are not limited to, the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Preferred Qualifications

  • Experience optimizing networks for AI collectives (all-reduce, all-gather, expert routing) and distributed training systems.
  • Familiarity with programmable data planes (P4, eBPF/XDP), in-network telemetry/compute, and NIC offloads (GRO/TSO/LRO, DPDK).
  • Depth in buffer management and queue disciplines (DWRR, WFQ, Deficit Round Robin, QCN, VOQ) and QoS for multi-tenant clusters.
  • Experience with optic/PHY roadmaps (800G/1.6T, linear pluggables, CPO/LPO, FEC trade-offs) and DC power/cooling constraints affecting network design.
  • Contributions to standards bodies/consortia (drafts, presentations) and vendor co-development.
  • Proven track record shipping production network designs with measurable latency/throughput improvements and reliability gains.
  • Proficiency in Python/Go and automation frameworks (Ansible/Terraform) for test, measurement, and CI.


Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:Microsoft will accept applications for the role until October 24th, 2025.


Responsibilities
  • Own end-to-end network architecture for AI training/inference clusters: topology, routing, transport, congestion control, QoS, telemetry, reliability, and failure domains.
  • Lead and grow a high-performing team (~10 engineers) across architecture, performance, and validation; set goals, mentor, and drive execution.
  • Define scale-out/scale-up designs (e.g., leaf-spine, dragonfly/dragonfly+, Clos/fat-tree, 2D/3D torus variants) and network services for job schedulers and accelerator runtimes.
  • Drive congestion-control strategy (ECN/PFC, DCQCN, HPCC, TIMELY, HULL, adaptive load balancing like CONGA/HULA) and transport tuning (RDMA/RoCEv2, QUIC/TCP variants) for tail-latency and throughput SLAs.
  • Hands-on analysis of switch/NIC behavior using counters, traces, and telemetry (PFC/ECN stats, INT, in-band telemetry, gNMI/gNOI, sFlow/NetFlow, eBPF); create reproducible perf tests.
  • Evaluate and influence silicon & optics (ASIC feature roadmaps, queueing/scheduling, packet recirculation, shared buffer, VOQs, cut-through vs store-and-forward, 400/800G, linear vs retimed optics).
  • Prototype and validate in lab and pre-prod: build testbeds, craft microbenchmarks and realistic AI workloads; automate with Python/Go/Ansible; codify SLOs and pass/fail gates.
  • Partner across teams (accelerator/HBM, storage, orchestration, reliability) to co-design network-aware collective ops (all-reduce/all-to-all/mixture-of-experts) and placement policies.
  • Influence standards and industry direction via active participation in IEEE 802.3/802.1, IETF, OCP, OIF, Ethernet Alliance, and vendor ecosystems; drive MSFT requirements into roadmaps.
  • Operational excellence: define observability, fault isolation, failure testing (Jepsen-style chaos, link flap/black-hole, incast), capacity planning, and upgrade/rollout strategies.
  • Documentation & reviews: author design docs, RFCs, and executive briefs; run design and readiness reviews.