Finding the best job has never been easier

Senior Site Reliability Engineer jobs at Nvidia in India, Dehradun

Discover your perfect match with Expoint. Search for job opportunities as a Senior Site Reliability Engineer in India, Dehradun and join the network of leading companies in the high tech industry, like Nvidia. Sign up now and find your dream job with Expoint

Company (1)

Job type

Job categories

Job title (1)

India

Dehradun

Creation date

3 jobs found

15.09.2025

Nvidia Senior Site Reliability Engineer DGX Cloud India, Uttarakhand, Dehradun

Limitless High-tech career opportunities - Expoint

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting. Define SLOs/SLIs, monitor error budgets, and...

Description:

India, Remote

time type: Full time

posted on: Posted 15 Days Ago

job requisition id

What you’ll be doing:

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation to support production services

What we need to see:

BS in Computer Science or related technical field, or equivalent experience
10+ years of experience operating production services
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
Proficiency in at least one high-level programming language (e.g., Python, Go)
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

Operating GPU-accelerated clusters with KubeVirt in production
Applying generative-AI techniques to reduce operational toil
Automating incidents with Shoreline or StackStorm

Full job details

These jobs might be a good fit

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud United States, Texas

07.09.2025

Nvidia Network Infrastructure Engineer India, Uttarakhand, Dehradun

Engage in 24/7 global shift rotations to provide remote support for network repairs and changes while collaborating across teams and updating customers on status and ticket information. Drive operational improvements...

Description:

time type: Full time

posted on: Posted 5 Days Ago

job requisition id

What you will be doing

Engage in 24/7 global shift rotations to provide remote support for network repairs and changes while collaborating across teams and updating customers on status and ticket information.
Drive operational improvements in change management and daily operations by following procedures.
Manage and operate large scale IP network technologies and infrastructures.
Utilise your skills in Peering and Datacenter interconnect technologies: PNI, Transit, Exchange, Passive DWDM, Wave circuits.
Monitor and support the network health of on-premises and cloud infrastructures.
Collaborate and develop workflow enhancements while documenting best practices.

What we need to see

Deep knowledge and experience of TCP/IP, BGP, OSPF, MPLS, IS-IS, VxLAN, EVPN, QoS, GRE, IPsec, DNS, and MACsec.
Over 4 years of experience in network operations.
Skilled in network troubleshooting techniques and leveraging creative problem-solving abilities.
Strong track record of alert response within defined SLAs and Incident management.
Experience with one or more of the following CSP environments: AWS, Azure, GCP, OCI.
Familiarity with Arista, Fortinet and Juniper.
Hands-on experience with contributing to tooling and automation for provisioning, monitoring, and managing complex network infrastructures.
Bachelor’s degree in Computer Science, related technical field, or equivalent experience.
Excellent verbal and written communication skills.

Ways To Stand Out From The Crowd:

Working knowledge of Mellanox/Cumulus OS.
Working knowledge of Infiniband technology.
Skilled in Unix/Linux system administration, with the ability to write and understand Python/Shell scripts to enhance productivity in hyperscale environments.
Familiarity with leveraging tools such as Netbox/Nautobot, Prometheus, Grafana, Panoptes to monitor and manage a global network.
Passionate about innovating and investing in ground breaking technologies.

Full job details

These jobs might be a good fit

27.07.2025

Nvidia Senior Site Reliability Engineer India, Uttarakhand, Dehradun

Design, build, and implement scalable cloud-based systems for PaaS/IaaS. Work closely with other teams on new products orfeatures/improvementsof existing products. Develop, maintain and improve cloud deployment of our software. Participate...

Description:

India, Remote

time type: Full time

posted on: Posted 6 Days Ago

job requisition id

What you'll be doing:

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

Design, build, and implement scalable cloud-based systems for PaaS/IaaS.
Work closely with other teams on new products orfeatures/improvementsof existing products.
Develop, maintain and improve cloud deployment of our software.
Participate in the triage & resolution of complex infra-related issues
Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.
Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces
Develop, maintain and improve automation tools that can help improve efficiency of SRE operations
Practice balanced incident response and blameless postmortems
Be part of an on-call rotation to support production systems

What we need to see:

BS or MS in Computer Science or equivalent program from an accredited University/College.
8+ years of hands-on software engineering or equivalent experience.
Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.
Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.
Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics
Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI/CD.
Exhibit knowledge in concepts of working with CSPs, for example: AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.

Ways to stand out from the crowd:

Expertise in technologies such as Stack-storm, OpenStack, Redhat OpenShift, AI DBs like Milvus.
A track record of solving complex problems with elegant solutions.
Prior experience with Go & Python, React.
Demonstrate delivery of complex projects in previous roles.
Showcase ability in developing Frontend application with concepts of SSA, RBAC

Full job details

These jobs might be a good fit

NvidiaSenior Site Reliability Engineer DGX Cloud

India, Uttarakhand, Dehradun

242085963

15.09.2025

Description:

India, Remote

time type: Full time

posted on: Posted 15 Days Ago

job requisition id

What you’ll be doing:

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation to support production services

What we need to see:

BS in Computer Science or related technical field, or equivalent experience
10+ years of experience operating production services
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
Proficiency in at least one high-level programming language (e.g., Python, Go)
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

Operating GPU-accelerated clusters with KubeVirt in production
Applying generative-AI techniques to reduce operational toil
Automating incidents with Shoreline or StackStorm

Full job details

These jobs might be a good fit

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud United States, Texas

Professional CV Builder tool from Expoint.

Get to the top of the "yes list" with a standout CV!

CREATE CV

Find your dream job in the high tech industry with Expoint. With our platform you can easily search for Senior Site Reliability Engineer opportunities at Nvidia in India, Dehradun. Whether you're seeking a new challenge or looking to work with a specific organization in a specific role, Expoint makes it easy to find your perfect job match. Connect with top companies in your desired area and advance your career in the high tech field. Sign up today and take the next step in your career journey with Expoint.