Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior DGX Cloud AI Infrastructure Software Engineer 
China, Shanghai 
518492987

02.07.2025
China, Shanghai
time type
Full time
posted on
Posted Today
job requisition id

What you’ll be doing:

  • Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure.

  • Develop and optimize tools to improve infrastructure efficiency and resiliency.

  • Root cause and analyze and triage failures from the application level to the hardware level

  • Enhance infrastructure and products underpinning NVIDIA's AI platforms.

  • Co-design and implement APIs for integration with NVIDIA's resiliency stacks.

  • Define meaningful and actionable reliability metrics to track and improve system and service reliability.

  • Skilled in problem-solving, root cause analysis, and optimization.

What we need to see:

  • Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.

  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).

  • Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.

  • Proven track record in building and scaling large-scale distributed systems.

  • Experience with AI training and inferencing and data infrastructure services.

  • Familiar in operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).

  • Proficiency in programming languages such as Python, C/C++, script languages

  • Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.

Ways to stand out from the crowd:

  • Experience in working with the large scale AI cluster

  • Strong understanding of NVIDIA GPUs, network technologies (RDMA, IB, NCCL)

  • Good understanding on DL frameworks internal PyTorch, TensorFlow, JAX, and Ray

  • Experience and root cause analysis of failures and datacenter scale

  • Strong background in software design and development.