Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

Nvidia Senior Systems Engineer – High-Performance AI Networking Applications 
United States, Texas 
333984182

Today
US, CA, Santa Clara
US, WA, Remote
US, CA, Remote
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you will be doing:

  • Collaborate with networking teams to plan, implement, and evaluate performance benchmarks on NVLINK, NVSwitch, and InfiniBand powered infrastructures.

  • Assess findings and work closely with framework, hardware, and support teams to improve system performance across various deep learning workloads.

  • Act as a primary resource for fixing networking and hardware integration issues, focusing on scalable multi-node systems.

  • Maintain high communication standards across multiple engineering, support, and R&D teams, ensuring technical and performance goals are met.

  • Offer technical mentorship and documentation for internal teams and external partners on standard methodologies in HPC networking deployments.

  • Share insights on improving networking strategies for substantial AI and deep learning infrastructure.

What we need to see:

  • BS/MS or PhD in Computer Science, Engineering, or related field, or equivalent experience.

  • 8+ years of proven experience in AI/HPC Infrastructure.

  • Familiarity with AI/HPC job schedulers and orchestrators like Slurm, K8s, or LSF. Practical exposure to AI/HPC workflows employing MPI and NCCL.

  • Familiarity with High-Speed Networking pertaining to HPC including InfiniBand, RDMA, RoCE, and Amazon EFA.

  • Essential to have an understanding of PyTorch, MegatronLM, and Deep Learning Inference frameworks such as vllm/sglang.

  • Proven experience with InfiniBand, NVLINK, and high-speed networking technologies in HPC or large-scale datacenter environments.

  • Investigating and evaluating performance in multi-node systems, especially in deep learning or scientific computing tasks.

  • Strong analytical, debugging, and technical communication skills.

  • Comfortable working in collaborative, multi-faceted teams.

Ways to stand out from the crowd:

  • Mastery in deep learning frameworks or distributed training systems.

  • Familiarity with datacenter automation, advanced network protocols, and supporting large HPC or AI clusters in production environments.

  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workload.

  • Experience with networking and communications libraries like NCCL, NIXL, NVSHMEM, UCX.

  • Experience developing or maintaining cluster management and monitoring tools Ex: ansible for infrastructure as a service, prometheus and grafana for monitoring.

You will also be eligible for equity and .