Share
It offers an enterprise-grade, full-stack solution, optimized for large-scale AI training and inference, and is available through partnerships with leading cloud service providers.
What you'll be doing:
Making the existing cluster automation platform more fault-tolerant, agile, hardware/networking aware, and resource-efficient
Enabling AI capabilities in the platform to enhance user experience and accelerate automation, and diagnosis and remediation of issues
Integrating with the ecosystem tools to enable a rich, unified user experience with full end-to-end capabilities
Collaborating with various stakeholders across NVIDIA to understand business context, influence the product roadmap, help with adoption of the automation platform, and reduce toil for managing clusters
Operating critical software services with high availability and reliability
Programming in systems languages like Rust and Go
Driving engineering best practices, mentoring engineers, and fostering an inclusive team culture
What we need to see:
Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience)
Keen interest in driving Agent AI projects
10 years of equivalent experience
Demonstrated ability in building scalable, agile, and robust distributed systems
Successful product rollouts and collaboration with early adopters
Technical leadership and ownership of projects across the organization
Hands-on approach, passion for continuous improvement, and willingness to get involved in all aspects of development
Experience working with ambiguity and driving clarity in complex technical decisions
Ways to stand out from the crowd:
Skilled in using AI to scale team productivity and agility
Experience with revamping complex systems with existing customers to take them to the next level
Experience with SRE, DevOps, CI/CD, and a variety of platforms
You will also be eligible for equity and .
These jobs might be a good fit