The point where experts and best companies meet
Share
What you will be doing:
Develop a team of SREs, providing mentorship, guidance, and support in achieving team goals.
Nurture a culture of collaboration, innovation, and continuous improvement within the SRE team.
Your team will be responsible for supporting and working on groundbreaking Generative AI inferencing workloads running in a globally distributed heterogeneous environment spanning 60+ edge locations plus all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.
Collaborate closely with the service owners, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.
Be a part of an on-call rotation while monitoring & supporting critical high-performance, large-scale services running multi-cloud.
Communicating and reporting service KPIs, priorities, and issues to leadership while driving premier incident responses.
Work closely with security teams to ensure the implementation of security best practices and compliance with relevant standards and regulations.
What we need to see:
MS or PhD in an engineering or computer science-related field or equivalent experience
8+ overall years of experience operating & owning end-to-end availability and performance of critically meaningful services in a live-site production environment, either as an SRE or Service Owner.
6+ years of technical leadership beyond development that includes scoping, requirements gathering, leading, and influencing multiple teams of engineers on broad development initiatives.
Experience leading an engineering team on projects with technical deep dives into cloud technologies (AWS/AZURE/GCP/OCI), code, networking, operating systems, storage etc.
Solid understanding of containerization and microservices architecture, K8s. Excellent knowledge of the Kubernetes ecosystem and standard methodologies with K8s.
Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Golang) and technologies (CI/CD auto-remediation, alert correlation).
Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly complex services.
Ways to stand out from the crowd:
Exposure to containerization and cloud-based deployments for AI models.
Excellent coding: Python, Go (Any similar language).
Prior experience driving production issues and helping with on-call support.
Understanding of Deep Learning / Machine Learning / AI.
Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton.
You will also be eligible for equity and .
These jobs might be a good fit