looking forn AI Infrastructure System Software Managercontinuously working to provide better tools to build and manage this id systemthe abiy tot out long termmaintenance strategy.
be doing:
Mentor, grow, and develop a world-class team of AI infrastructure engineers.
Work across several teams and orgs to build products that use LLMs and agent systems to serve the needs of NVIDIA engineering teams. In that role, you will be collaborating with research and infra teams and serve a large user base (hardware/software teams across NVIDIA).
Align priorities across collaborators and define metrics for measuring the success of the product/team.
Develop and execute strategies for scalable, reliable, and secure AI infrastructure supporting both research and productionworkloads.
Ensure robust monitoring, logging, visualization, and alerting capabilities to guarantee promised uptime and operational excellence.
Architect, design, develop, and maintain infrastructure and large-scale applications for LLM-based solutions. Optimize these systems for performance, scalability, reliability, and secure data management.
Stay updated with the latest trends in AI, ML, and infrastructure, proactively seeking opportunities to integrate advancements into Nvidia’s LLM and AI infrastructure solutions.
What we need to see:
10+ overall years of industry large distributed system software development experience.
BS+ degree in CS or related/equivalent experience.
5+ years of experience managing of AI and SW development teams.
Familiarity with modern software development stacks and tools, including containerization, cloud or on-premises deployments, API integration for seamless model operation, and real-time processingframeworks.
Experience in developing and maintaining LLM or GenAIinfrastructure
Excellent communication, collaboration and problem-solving skills, with a dedication to encouraging an inclusive and diverseworkplace.
Hands-on experience developing large-scale distributed systems
Ways to stand out from the crowd:
Strong technical background in cloud/distributed infrastructure
Experience debugging functional and performance issues in HPC GPU clusters
Background in running and instrumenting distributed LLM training on a multi GPU HPC cluster
Experience with HPC schedulers such as Slurm
You will also be eligible for equity and .
משרות נוספות שיכולות לעניין אותך