מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Nvidia Senior Site Reliability Engineer DGX Cloud
India, Uttarakhand, Dehradun
242085963

16.09.2025

שיתוף

India, Remote

What you’ll be doing:

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation to support production services

What we need to see:

BS in Computer Science or related technical field, or equivalent experience
10+ years of experience operating production services
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
Proficiency in at least one high-level programming language (e.g., Python, Go)
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

משרות נוספות שיכולות לעניין אותך

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

הצטרפו למאות שיצרו קורות חיים ושדרגו את הקריירה שלהם

צרו קו"ח