Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer 
India, Uttarakhand, Dehradun 
301553854

28.07.2025
India, Remote
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you'll be doing:

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

  • Design, build, and implement scalable cloud-based systems for PaaS/IaaS.

  • Work closely with other teams on new products orfeatures/improvementsof existing products.

  • Develop, maintain and improve cloud deployment of our software.

  • Participate in the triage & resolution of complex infra-related issues

  • Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.

  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces

  • Develop, maintain and improve automation tools that can help improve efficiency of SRE operations

  • Practice balanced incident response and blameless postmortems

  • Be part of an on-call rotation to support production systems

What we need to see:

  • BS or MS in Computer Science or equivalent program from an accredited University/College.

  • 8+ years of hands-on software engineering or equivalent experience.

  • Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.

  • Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.

  • Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics

  • Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI/CD.

  • Exhibit knowledge in concepts of working with CSPs, for example: AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.

Ways to stand out from the crowd:

  • Expertise in technologies such as Stack-storm, OpenStack, Redhat OpenShift, AI DBs like Milvus.

  • A track record of solving complex problems with elegant solutions.

  • Prior experience with Go & Python, React.

  • Demonstrate delivery of complex projects in previous roles.

  • Showcase ability in developing Frontend application with concepts of SSA, RBAC