Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer - GeForce 
United States, Texas 
338129888

12.08.2024

What you will be doing:

  • Support and work on groundbreaking Generative AI inferencing and training workloads running in a globally-distributed heterogeneous environment that spans all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.

  • Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.

  • Monitoring & supporting critical high-performance, large-scale services running multi-cloud.

  • Participate in the triage & resolution of sophisticated infra-related issues.

  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.

  • Practice balanced incident response and blameless postmortems.

  • Be part of an on-call rotation to support production systems.

  • Lead significant production improvement around tooling, automation, and process.

  • Architect, design, and code using your expertise to optimize, deploy and productize services.

What we need to see:

  • 8+ years of demonstrated experience operating & owning end-to-end availability and performance of critically important services in a live-site production environment, either as an SRE or Service Owner.

  • 3+ years of incident management experience and participating in an on call shift to support production services.

  • Bachelors or equivalent experience.

  • AWS infra configuration and administration of environments.

  • Proven understanding of containerization and microservices architecture, K8s. Excellent understanding of the Kubernetes ecosystem and standard methodologies with K8s.

  • Ability to dissect sophisticated problems into simple sub-problems and use available solutions to resolve them.

  • Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly sophisticated services.

  • Experience with the ELK and Prometheus stacks as a power user and administrator.

  • Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI.

  • Validated strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.

Ways to stand out from the crowd:

  • Exposure to containerization and cloud-based deployments for AI models.

  • Excellent coding: Python, Go (Any similar language).

  • Understanding of Deep Learning / Machine Learning / AI.

  • Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton.

  • Excellent communication, presentation, social, and analytical skills; the ability to communicate complex concepts clearly and persuasively across different audiences and varying levels of the organization.

You will also be eligible for equity and .