What you’ll be doing:
Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure.
Develop and optimize tools to improve infrastructure efficiency and resiliency.
Root cause and analyze and triage failures from the application level to the hardware level
Enhance infrastructure and products underpinning NVIDIA's AI platforms.
Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
Define meaningful and actionable reliability metrics to track and improve system and service reliability.
Skilled in problem-solving, root cause analysis, and optimization.
What we need to see:
Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
Proven track record in building and scaling large-scale distributed systems.
Experience with AI training and inferencing and data infrastructure services.
Familiar in operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
Proficiency in programming languages such as Python, C/C++, script languages
Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.
Ways to stand out from the crowd:
Experience in working with the large scale AI cluster
Strong understanding of NVIDIA GPUs, network technologies (RDMA, IB, NCCL)
Good understanding on DL frameworks internal PyTorch, TensorFlow, JAX, and Ray
Experience and root cause analysis of failures and datacenter scale
Strong background in software design and development.
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך