Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Manager Site Reliability Engineer 
United States, Texas 
18663479

12.08.2024

What you will be doing:

  • Develop a team of SREs, providing mentorship, guidance, and support in achieving team goals.

  • Nurture a culture of collaboration, innovation, and continuous improvement within the SRE team.

  • Your team will be responsible for supporting and working on groundbreaking Generative AI inferencing workloads running in a globally distributed heterogeneous environment spanning 60+ edge locations plus all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.

  • Collaborate closely with the service owners, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.

  • Be a part of an on-call rotation while monitoring & supporting critical high-performance, large-scale services running multi-cloud.

  • Communicating and reporting service KPIs, priorities, and issues to leadership while driving premier incident responses.

  • Work closely with security teams to ensure the implementation of security best practices and compliance with relevant standards and regulations.

What we need to see:

  • MS or PhD in an engineering or computer science-related field or equivalent experience

  • 8+ overall years of experience operating & owning end-to-end availability and performance of critically meaningful services in a live-site production environment, either as an SRE or Service Owner.

  • 6+ years of technical leadership beyond development that includes scoping, requirements gathering, leading, and influencing multiple teams of engineers on broad development initiatives.

  • Experience leading an engineering team on projects with technical deep dives into cloud technologies (AWS/AZURE/GCP/OCI), code, networking, operating systems, storage etc.

  • Solid understanding of containerization and microservices architecture, K8s. Excellent knowledge of the Kubernetes ecosystem and standard methodologies with K8s.

  • Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Golang) and technologies (CI/CD auto-remediation, alert correlation).

  • Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly complex services.

Ways to stand out from the crowd:

  • Exposure to containerization and cloud-based deployments for AI models.

  • Excellent coding: Python, Go (Any similar language).

  • Prior experience driving production issues and helping with on-call support.

  • Understanding of Deep Learning / Machine Learning / AI.

  • Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton.

You will also be eligible for equity and .