Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer 
India, Karnataka, Bengaluru 
365387779

24.06.2024

What you will be doing:

  • Support and work on groundbreaking Generative AI inferencing workloads running in a globally-distributed heterogeneous environment spanning 60+ edge locations plus all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.

  • Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.

  • Monitoring & supporting critical high-performance, large-scale services running multi-cloud.

  • Participate in the triage & resolution of sophisticated infra-related issues.

  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

  • Practice balanced incident response and blameless postmortems.

  • Be part of an on-call rotation to support production systems and lead significant production improvement around tooling, automation, and process.

  • Architect, design, and code using your expertise to optimize, deploy and productize services.

What we need to see:

  • 8+ years of experience operating & owning end-to-end availability and performance of mission-critical services in a live-site production environment, either as an SRE or Service Owner.

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience

  • Solid understanding of containerization and microservices architecture, K8s. Excellent understanding of the Kubernetes ecosystem and best practices with K8s.

  • Ability to dissect complex problems into simple sub-problems and use available solutions to resolve them.

  • Technical leadership beyond development that includes scoping, requirements capturing, leading and influencing multiple teams of engineers on broad development initiatives.

  • Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Go ) and technologies (CI/CD auto-remediation, alert correlation).

  • Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly sophisticated services.

  • Experience with the ELK and Prometheus stacks as a power user and administrator.

  • Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI.

  • Proven strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.

Ways to stand out from the crowd:

  • Exposure to containerization and cloud-based deployments for AI models.

  • Excellent coding: Python, Go (Any similar language).

  • Prior experience driving production issues and helping with on-call support and understanding of Deep Learning / Machine Learning / AI.

  • Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton as well as experience with StackStorm and similar automation platforms is a bonus.

  • Understanding of observability instrumentation techniques and best practices, including OpenTelemetry.