Finding the best job has never been easier

Nvidia Senior Site Reliability Engineer - GeForce
United States, Texas
338129888

12.08.2024

What you will be doing:

Support and work on groundbreaking Generative AI inferencing and training workloads running in a globally-distributed heterogeneous environment that spans all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.
Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.
Monitoring & supporting critical high-performance, large-scale services running multi-cloud.
Participate in the triage & resolution of sophisticated infra-related issues.
Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
Practice balanced incident response and blameless postmortems.
Be part of an on-call rotation to support production systems.
Lead significant production improvement around tooling, automation, and process.
Architect, design, and code using your expertise to optimize, deploy and productize services.

What we need to see:

8+ years of demonstrated experience operating & owning end-to-end availability and performance of critically important services in a live-site production environment, either as an SRE or Service Owner.
3+ years of incident management experience and participating in an on call shift to support production services.
Bachelors or equivalent experience.
AWS infra configuration and administration of environments.
Proven understanding of containerization and microservices architecture, K8s. Excellent understanding of the Kubernetes ecosystem and standard methodologies with K8s.
Ability to dissect sophisticated problems into simple sub-problems and use available solutions to resolve them.
Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly sophisticated services.
Experience with the ELK and Prometheus stacks as a power user and administrator.
Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI.
Validated strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.

Ways to stand out from the crowd:

Exposure to containerization and cloud-based deployments for AI models.
Excellent coding: Python, Go (Any similar language).
Understanding of Deep Learning / Machine Learning / AI.
Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton.
Excellent communication, presentation, social, and analytical skills; the ability to communicate complex concepts clearly and persuasively across different audiences and varying levels of the organization.

You will also be eligible for equity and .

These jobs might be a good fit

Nvidia Senior Site Reliability Engineer - GeForce United States, Texas

Get to the top of the "yes list" with a standout CV!

CREATE CV