Expoint – all jobs in one place

המקום בו המומחים והחברות הטובות ביותר נפגשים

Limitless High-tech career opportunities - Expoint

Nvidia Senior DGX Cloud AI Infrastructure Software Engineer
United States, Texas
291017378

20.05.2025

שיתוף

התחבר/י כדי להגיש מועמדות

US, CA, Santa Clara

US, TX, Austin

US, WA, Redmond

US, OR, Remote

time type: Full time

posted on: Posted 7 Days Ago

job requisition id

What you’ll be doing:

Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure.
Develop and optimize tools to improve infrastructure efficiency and resiliency.
Root cause and analyze and triage failures from the application level to the hardware level
Enhance infrastructure and products underpinning NVIDIA's AI platforms.
Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
Define meaningful and actionable reliability metrics to track and improve system and service reliability.
Skilled in problem-solving, root cause analysis, and optimization.

What we need to see:

Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
Proven track record in building and scaling large-scale distributed systems.
Experience with AI training and inferencing and data infrastructure services.
Familiar in operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
Proficiency in programming languages such as Python, C/C++, script languages
Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.

Ways to stand out from the crowd:

Experience in working with the large scale AI cluster
Strong understanding of NVIDIA GPUs, network technologies (RDMA, IB, NCCL)
Good understanding on DL frameworks internal PyTorch, TensorFlow, JAX, and Ray
Experience and root cause analysis of failures and datacenter scale
Strong background in software design and development.

You will also be eligible for equity and .

פרטי המשרה המלאים

משרות נוספות שיכולות לעניין אותך

Nvidia Senior DGX Cloud AI Infrastructure Software Engineer United States, Texas

Nvidia Senior AI Infrastructure Software Engineer - DGX Cloud United States, Texas

Nvidia Senior AI Infrastructure Engineer - DGX Cloud United States, Texas

Nvidia Senior AI Infrastructure Engineer - DGX Cloud United States, Texas

כלי לבניית קורות חיים מקצועיים מבית אקספוינט

הצטרפו למאות שיצרו קורות חיים ושדרגו את הקריירה שלהם

צרו קו"ח