Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer DGX Cloud 
United States, California 
123189682

31.08.2025
US, CA, Remote
time type
Full time
posted on
Posted 2 Days Ago
job requisition id

What you’ll be doing:

  • Support large-scale Kubernetes services before they launch through system creation consulting, developing software tools, platforms,, and frameworks, capacity management, and launch reviews

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging and alerting

  • Define SLOs/SLIs, monitor error budgets, and streamline reporting

  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health

  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity

  • Lead triage and root-cause analysis of high-severity incidents

  • Practice balanced incident response and blameless postmortems

  • Participate in on-call rotation to support production services

What we need to see:

  • BS in Computer Science or related technical field, or equivalent experience

  • 12+ years of experience operating production services at scale

  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture, with deep experience in Kubernetes operators and distributed systems at scale.

  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)

  • Proficiency in at least one high-level programming language (e.g., Python, Go)

  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards

  • Demonstrated ability to troubleshoot complex DNS, network, Kubernetes, and systems issues in production environments.

  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling

  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Datadog, etc.

Ways to stand out from the crowd:

  • Operating GPU-accelerated clusters with KubeVirt in production

  • Applying generative-AI techniques to reduce operational toil

  • Automating incidents with Shoreline or StackStorm

  • GPU workload orchestration and large-scale GPU resource management

You will also be eligible for equity and .