Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Software Manager AI Infrastructure System 
United States, California 
235489708

Yesterday
US, CA, Santa Clara
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

looking forn AI Infrastructure System Software Managercontinuously working to provide better tools to build and manage this id systemthe abiy tot out long termmaintenance strategy.


be doing:

  • Mentor, grow, and develop a world-class team of AI infrastructure engineers.

  • Work across several teams and orgs to build products that use LLMs and agent systems to serve the needs of NVIDIA engineering teams. In that role, you will be collaborating with research and infra teams and serve a large user base (hardware/software teams across NVIDIA).

  • Align priorities across collaborators and define metrics for measuring the success of the product/team.

  • Develop and execute strategies for scalable, reliable, and secure AI infrastructure supporting both research and productionworkloads.

  • Ensure robust monitoring, logging, visualization, and alerting capabilities to guarantee promised uptime and operational excellence.

  • Architect, design, develop, and maintain infrastructure and large-scale applications for LLM-based solutions. Optimize these systems for performance, scalability, reliability, and secure data management.

  • Stay updated with the latest trends in AI, ML, and infrastructure, proactively seeking opportunities to integrate advancements into Nvidia’s LLM and AI infrastructure solutions.

What we need to see:

  • 10+ overall years of industry large distributed system software development experience.

  • BS+ degree in CS or related/equivalent experience.

  • 5+ years of experience managing of AI and SW development teams.

  • Familiarity with modern software development stacks and tools, including containerization, cloud or on-premises deployments, API integration for seamless model operation, and real-time processingframeworks.

  • Experience in developing and maintaining LLM or GenAIinfrastructure

  • Excellent communication, collaboration and problem-solving skills, with a dedication to encouraging an inclusive and diverseworkplace.

  • Hands-on experience developing large-scale distributed systems

Ways to stand out from the crowd:

  • Strong technical background in cloud/distributed infrastructure

  • Experience debugging functional and performance issues in HPC GPU clusters

  • Background in running and instrumenting distributed LLM training on a multi GPU HPC cluster

  • Experience with HPC schedulers such as Slurm

You will also be eligible for equity and .