Expoint - all jobs in one place

המקום בו המומחים והחברות הטובות ביותר נפגשים

Limitless High-tech career opportunities - Expoint

Nvidia SRE Manager NIM Factory 
United States, California 
145121127

18.08.2024

What you'll be doing:

  • This is a ground floor opportunity to form a team and define the SRE role in the NIM program. Your team will operate a software factory that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments. With the development team, define and deliver rapid iterations on the group's technical strategies and roadmaps to evolve the NIM factory for continuous delivery of packaged NIMs. Your team is responsible for the operation of the factory, its availability, observability, and stability; and will track the deployment of our services into multiple cloud hosts and improve the efficiency, availability, and stability of these services.

  • You will partner with internal and external SRE team leadership to provide the best experience for our developers and our users of the resulting services. Your team ensures our operation is secure with the proper configuration and management of infrastructure including containers, databases, and networking; following and improving standard processes for security, scalability, and cost optimization. This requires working closely with our security teams tasked with responding to security threats.

  • Broad collaboration with multiple AI model teams is needed to understand their requirements and build an efficient infrastructure that supports and improves development and production execution of these models. You will define metrics and drive improvements based on user feedback. You will mentor and collaborate throughout the team and with other teams to grow your colleagues and yourself. You will have a history of learning and growing your skills and those around you.

What we need to see:

  • Supportive mentoring and empathetic leadership recruiting and growing successful teams and team members. Flexibility and a clear ability to adjust your direction and expectations given the needs of our customers.

  • Effective experience working with multi-functional teams, principals and architects, and across organizational boundaries.

  • Demonstrated advanced system engineering skills operating and improving the observability, security and maintainability of distributed microservice cloud applications and services. Experience operating distributed containerize applications using technologies such as Docker, K8s, Cloud Endpoints, Helm, and Prometheus. Use of Infrastructure as code, such at Terraform, Puppet, Ansible or others.

  • Experience identifying the root cause of failures and performance bottlenecks in distributed microservices or cloud systems. Understand and practice good security practices for publicly facing cloud services.

  • BS or MS in Computer Science, Computer Engineering or equivalent experience.

  • 7+ overall years of experience as an SRE or Developer working on high-performance microservices and cloud software; 3+ years leading or managing engineering teams.

Ways to stand out from the crowd:

  • Excellent communication and interpersonal skills and the ability to engage a multi-functional team.

  • Experience with cloud deployed infrastructure, effective security practices in a high-risk environments, and hyper scaling applications for demand.

  • A history of building and deploying containers for Microservices, Cloud and On-prem deployments, and their associated CI/CD pipelines

You will also be eligible for equity and .