Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior SRE Engineer NIM Factory 
United States, Texas 
255160090

18.08.2024

What you'll be doing:

  • Operate a software factory that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments. With the development team, define and deliver rapid iterations on the group's technical strategies and roadmaps to evolve the NIM factory for continuous delivery of packaged NIMs. You will be responsible for both the operation of the factory, its availability, observability, and stability; and will track the deployment of our services into multiple cloud hosts and improve the efficiency, availability, and stability of these services.

  • Partner with internal and external SRE teams to provide the best experience for our developers and our users of the resulting services. Your work ensures our operation is secure with the proper configuration and management of infrastructure including containers, databases, and networking; following and improving standard processes for security, scalability, and cost optimization. This requires working closely with our security teams tasked with responding to security threats.

  • Broad collaboration with multiple AI model teams is needed to understand their requirements and build an efficient infrastructure that supports and improves development and production execution of these models. You will define metrics and drive improvements based on user feedback. You will mentor and collaborate throughout the team and with other teams to grow your colleagues and yourself. You will have a history of learning and growing your skills and those around you.

What we need to see:

  • Demonstrated advanced system engineering skills operating and improving the observability and maintainability of distributed microservice cloud applications and services.

  • Effective experience working with multi-functional teams, principals and architects, and across organizational boundaries.

  • Mentorship, growing teams and team members, and the flexibility to ability to adjust your direction and expectations given the needs of our customers.

  • Experience operating distributed containerize applications using technologies such as Docker, K8s, Cloud Endpoints, Helm, and Prometheus. Use of Infrastructure as code, such at Terraform, Puppet, Ansible or others.

  • Experience identifying the root cause of failures and performance bottlenecks in distributed microservices or cloud systems. Understand and practice good security practices for publicly facing cloud services.

  • BS or MS in Computer Science, Computer Engineering or equivalent experience.

  • 7+ years of shown experience as an SRE or Developer working on high-performance microservices and cloud software.

Ways to stand out from the crowd:

  • Excellent communication and interpersonal skills and the ability to engage a multi-functional team.

  • Experience with event-driven applications using various services such as Temporal, Kafka, Redis or others.

  • A history of building and deploying containers for Microservices, Cloud and On-prem deployments, and their associated CI/CD pipelines

You will also be eligible for equity and .