Expoint – all jobs in one place
Finding the best job has never been easier

Senior Site Reliability Engineer jobs at Nvidia in India, Dehradun

Discover your perfect match with Expoint. Search for job opportunities as a Senior Site Reliability Engineer in India, Dehradun and join the network of leading companies in the high tech industry, like Nvidia. Sign up now and find your dream job with Expoint
Company (1)
Job type
Job categories
Job title (1)
India
Dehradun
3 jobs found
15.09.2025
N

Nvidia Senior Site Reliability Engineer DGX Cloud India, Uttarakhand, Dehradun

Limitless High-tech career opportunities - Expoint
Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting. Define SLOs/SLIs, monitor error budgets, and...
Description:
India, Remote
time type
Full time
posted on
Posted 15 Days Ago
job requisition id

What you’ll be doing:

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Define SLOs/SLIs, monitor error budgets, and streamline reporting

  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity

  • Lead triage and root-cause analysis of high-severity incidents

  • Practice balanced incident response and blameless postmortems

  • Participate in on-call rotation to support production services

What we need to see:

  • BS in Computer Science or related technical field, or equivalent experience

  • 10+ years of experience operating production services

  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture

  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)

  • Proficiency in at least one high-level programming language (e.g., Python, Go)

  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards

  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling

  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

  • Operating GPU-accelerated clusters with KubeVirt in production

  • Applying generative-AI techniques to reduce operational toil

  • Automating incidents with Shoreline or StackStorm

Show more
07.09.2025
N

Nvidia Network Infrastructure Engineer India, Uttarakhand, Dehradun

Limitless High-tech career opportunities - Expoint
Engage in 24/7 global shift rotations to provide remote support for network repairs and changes while collaborating across teams and updating customers on status and ticket information. Drive operational improvements...
Description:
India, Remote
India, Hyderabad
India, Pune
India, Mumbai
India, Gurugram
time type
Full time
posted on
Posted 5 Days Ago
job requisition id

What you will be doing

  • Engage in 24/7 global shift rotations to provide remote support for network repairs and changes while collaborating across teams and updating customers on status and ticket information.

  • Drive operational improvements in change management and daily operations by following procedures.

  • Manage and operate large scale IP network technologies and infrastructures.

  • Utilise your skills in Peering and Datacenter interconnect technologies: PNI, Transit, Exchange, Passive DWDM, Wave circuits.

  • Monitor and support the network health of on-premises and cloud infrastructures.

  • Collaborate and develop workflow enhancements while documenting best practices.

What we need to see

  • Deep knowledge and experience of TCP/IP, BGP, OSPF, MPLS, IS-IS, VxLAN, EVPN, QoS, GRE, IPsec, DNS, and MACsec.

  • Over 4 years of experience in network operations.

  • Skilled in network troubleshooting techniques and leveraging creative problem-solving abilities.

  • Strong track record of alert response within defined SLAs and Incident management.

  • Experience with one or more of the following CSP environments: AWS, Azure, GCP, OCI.

  • Familiarity with Arista, Fortinet and Juniper.

  • Hands-on experience with contributing to tooling and automation for provisioning, monitoring, and managing complex network infrastructures.

  • Bachelor’s degree in Computer Science, related technical field, or equivalent experience.

  • Excellent verbal and written communication skills.


Ways To Stand Out From The Crowd:

  • Working knowledge of Mellanox/Cumulus OS.

  • Working knowledge of Infiniband technology.

  • Skilled in Unix/Linux system administration, with the ability to write and understand Python/Shell scripts to enhance productivity in hyperscale environments.

  • Familiarity with leveraging tools such as Netbox/Nautobot, Prometheus, Grafana, Panoptes to monitor and manage a global network.

  • Passionate about innovating and investing in ground breaking technologies.

Show more

These jobs might be a good fit

27.07.2025
N

Nvidia Senior Site Reliability Engineer India, Uttarakhand, Dehradun

Limitless High-tech career opportunities - Expoint
Design, build, and implement scalable cloud-based systems for PaaS/IaaS. Work closely with other teams on new products orfeatures/improvementsof existing products. Develop, maintain and improve cloud deployment of our software. Participate...
Description:
India, Remote
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you'll be doing:

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

  • Design, build, and implement scalable cloud-based systems for PaaS/IaaS.

  • Work closely with other teams on new products orfeatures/improvementsof existing products.

  • Develop, maintain and improve cloud deployment of our software.

  • Participate in the triage & resolution of complex infra-related issues

  • Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.

  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces

  • Develop, maintain and improve automation tools that can help improve efficiency of SRE operations

  • Practice balanced incident response and blameless postmortems

  • Be part of an on-call rotation to support production systems

What we need to see:

  • BS or MS in Computer Science or equivalent program from an accredited University/College.

  • 8+ years of hands-on software engineering or equivalent experience.

  • Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.

  • Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.

  • Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics

  • Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI/CD.

  • Exhibit knowledge in concepts of working with CSPs, for example: AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.

Ways to stand out from the crowd:

  • Expertise in technologies such as Stack-storm, OpenStack, Redhat OpenShift, AI DBs like Milvus.

  • A track record of solving complex problems with elegant solutions.

  • Prior experience with Go & Python, React.

  • Demonstrate delivery of complex projects in previous roles.

  • Showcase ability in developing Frontend application with concepts of SSA, RBAC

Show more

These jobs might be a good fit

Limitless High-tech career opportunities - Expoint
Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting. Define SLOs/SLIs, monitor error budgets, and...
Description:
India, Remote
time type
Full time
posted on
Posted 15 Days Ago
job requisition id

What you’ll be doing:

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Define SLOs/SLIs, monitor error budgets, and streamline reporting

  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity

  • Lead triage and root-cause analysis of high-severity incidents

  • Practice balanced incident response and blameless postmortems

  • Participate in on-call rotation to support production services

What we need to see:

  • BS in Computer Science or related technical field, or equivalent experience

  • 10+ years of experience operating production services

  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture

  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)

  • Proficiency in at least one high-level programming language (e.g., Python, Go)

  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards

  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling

  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

  • Operating GPU-accelerated clusters with KubeVirt in production

  • Applying generative-AI techniques to reduce operational toil

  • Automating incidents with Shoreline or StackStorm

Show more
Find your dream job in the high tech industry with Expoint. With our platform you can easily search for Senior Site Reliability Engineer opportunities at Nvidia in India, Dehradun. Whether you're seeking a new challenge or looking to work with a specific organization in a specific role, Expoint makes it easy to find your perfect job match. Connect with top companies in your desired area and advance your career in the high tech field. Sign up today and take the next step in your career journey with Expoint.