Expoint - all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior System Software Architect HPC AI Networking 
China, Beijing, Beijing 
309447720

Today
China, Beijing
time type
Full time
posted on
Posted 3 Days Ago
job requisition id

forward-thinking HPC and AI Inference Software Architectto help shape the future of scalable AIinfrastructure—focusingon distributed training, real-time inference, and communication optimization across large-scale systems.


What you will be doing:

  • Design and prototype scalable software systems that optimize distributed AI training and inference—focusing on throughput, latency, and memory efficiency.

  • Develop and evaluate enhancements to communication libraries such asNCCL,UCX, andUCC, tailored to the unique demands of deep learning workloads.

  • Collaborate with AI framework teams (e.g., TensorFlow,PyTorch, JAX) to improve integration, performance, and reliability of communication backends.

  • Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving.

  • Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers.

  • Collaborate with customers to understand their needs and provide innovative solutions for them.

What we need to see:

  • Ph.D, Masters, or Bachelors in computer science, computer engineering, electrical engineering or a closely related field.

  • 5+ years of experience in DNNs, Scaling of DNNs, Parallelism of DNN frameworks, or deep learning training workloads.

  • Deep understanding of Inference and Training workloads and optimizations, like Prefill/Decode, data parallelism, Tensor parallelism, FDSP, etc...

  • Experience with AI network parallelism using collective libraries and RDMA/RoCE.

  • Background in algorithm design, system programming, and computer architecture.

  • Strong programming and software development skills.

  • Ability and flexibility to work and communicate effectively in a multi-national, multi-time-zone corporate environment.

Ways to stand out from the crowd:

  • Deep understanding of technology and passion for what you do.

  • Strong collaborative and interpersonal skills, specifically a proven ability to effectively guide and influence within a dynamic matrix environment.

  • Background with designing communication middleware for high-performance computing systems, including RoCE and DPUs.

  • Background with CUDA programming and NVIDIA GPUs and programming models for emerging architectures.