Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

דרושים Senior Site Reliability Engineer ב-אנבידיה ב-India, Dehradun

מצאו את ההתאמה המושלמת עבורכם עם אקספוינט! חפשו הזדמנויות עבודה בתור Senior Site Reliability Engineer ב-India, Dehradun והצטרפו לרשת החברות המובילות בתעשיית ההייטק, כמו Nvidia. הירשמו עכשיו ומצאו את עבודת החלומות שלך עם אקספוינט!
חברה (1)
אופי המשרה
קטגוריות תפקיד
שם תפקיד (1)
India
Dehradun
נמצאו 3 משרות
15.09.2025
N

Nvidia Senior Site Reliability Engineer DGX Cloud India, Uttarakhand, Dehradun

Limitless High-tech career opportunities - Expoint
Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting. Define SLOs/SLIs, monitor error budgets, and...
תיאור:
India, Remote
time type
Full time
posted on
Posted 15 Days Ago
job requisition id

What you’ll be doing:

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Define SLOs/SLIs, monitor error budgets, and streamline reporting

  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity

  • Lead triage and root-cause analysis of high-severity incidents

  • Practice balanced incident response and blameless postmortems

  • Participate in on-call rotation to support production services

What we need to see:

  • BS in Computer Science or related technical field, or equivalent experience

  • 10+ years of experience operating production services

  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture

  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)

  • Proficiency in at least one high-level programming language (e.g., Python, Go)

  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards

  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling

  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

  • Operating GPU-accelerated clusters with KubeVirt in production

  • Applying generative-AI techniques to reduce operational toil

  • Automating incidents with Shoreline or StackStorm

Show more
07.09.2025
N

Nvidia Network Infrastructure Engineer India, Uttarakhand, Dehradun

Limitless High-tech career opportunities - Expoint
Engage in 24/7 global shift rotations to provide remote support for network repairs and changes while collaborating across teams and updating customers on status and ticket information. Drive operational improvements...
תיאור:
India, Remote
India, Hyderabad
India, Pune
India, Mumbai
India, Gurugram
time type
Full time
posted on
Posted 5 Days Ago
job requisition id

What you will be doing

  • Engage in 24/7 global shift rotations to provide remote support for network repairs and changes while collaborating across teams and updating customers on status and ticket information.

  • Drive operational improvements in change management and daily operations by following procedures.

  • Manage and operate large scale IP network technologies and infrastructures.

  • Utilise your skills in Peering and Datacenter interconnect technologies: PNI, Transit, Exchange, Passive DWDM, Wave circuits.

  • Monitor and support the network health of on-premises and cloud infrastructures.

  • Collaborate and develop workflow enhancements while documenting best practices.

What we need to see

  • Deep knowledge and experience of TCP/IP, BGP, OSPF, MPLS, IS-IS, VxLAN, EVPN, QoS, GRE, IPsec, DNS, and MACsec.

  • Over 4 years of experience in network operations.

  • Skilled in network troubleshooting techniques and leveraging creative problem-solving abilities.

  • Strong track record of alert response within defined SLAs and Incident management.

  • Experience with one or more of the following CSP environments: AWS, Azure, GCP, OCI.

  • Familiarity with Arista, Fortinet and Juniper.

  • Hands-on experience with contributing to tooling and automation for provisioning, monitoring, and managing complex network infrastructures.

  • Bachelor’s degree in Computer Science, related technical field, or equivalent experience.

  • Excellent verbal and written communication skills.


Ways To Stand Out From The Crowd:

  • Working knowledge of Mellanox/Cumulus OS.

  • Working knowledge of Infiniband technology.

  • Skilled in Unix/Linux system administration, with the ability to write and understand Python/Shell scripts to enhance productivity in hyperscale environments.

  • Familiarity with leveraging tools such as Netbox/Nautobot, Prometheus, Grafana, Panoptes to monitor and manage a global network.

  • Passionate about innovating and investing in ground breaking technologies.

Show more

משרות נוספות שיכולות לעניין אותך

27.07.2025
N

Nvidia Senior Site Reliability Engineer India, Uttarakhand, Dehradun

Limitless High-tech career opportunities - Expoint
Design, build, and implement scalable cloud-based systems for PaaS/IaaS. Work closely with other teams on new products orfeatures/improvementsof existing products. Develop, maintain and improve cloud deployment of our software. Participate...
תיאור:
India, Remote
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you'll be doing:

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

  • Design, build, and implement scalable cloud-based systems for PaaS/IaaS.

  • Work closely with other teams on new products orfeatures/improvementsof existing products.

  • Develop, maintain and improve cloud deployment of our software.

  • Participate in the triage & resolution of complex infra-related issues

  • Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.

  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces

  • Develop, maintain and improve automation tools that can help improve efficiency of SRE operations

  • Practice balanced incident response and blameless postmortems

  • Be part of an on-call rotation to support production systems

What we need to see:

  • BS or MS in Computer Science or equivalent program from an accredited University/College.

  • 8+ years of hands-on software engineering or equivalent experience.

  • Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.

  • Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.

  • Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics

  • Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI/CD.

  • Exhibit knowledge in concepts of working with CSPs, for example: AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.

Ways to stand out from the crowd:

  • Expertise in technologies such as Stack-storm, OpenStack, Redhat OpenShift, AI DBs like Milvus.

  • A track record of solving complex problems with elegant solutions.

  • Prior experience with Go & Python, React.

  • Demonstrate delivery of complex projects in previous roles.

  • Showcase ability in developing Frontend application with concepts of SSA, RBAC

Show more

משרות נוספות שיכולות לעניין אותך

Limitless High-tech career opportunities - Expoint
Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting. Define SLOs/SLIs, monitor error budgets, and...
תיאור:
India, Remote
time type
Full time
posted on
Posted 15 Days Ago
job requisition id

What you’ll be doing:

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Define SLOs/SLIs, monitor error budgets, and streamline reporting

  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity

  • Lead triage and root-cause analysis of high-severity incidents

  • Practice balanced incident response and blameless postmortems

  • Participate in on-call rotation to support production services

What we need to see:

  • BS in Computer Science or related technical field, or equivalent experience

  • 10+ years of experience operating production services

  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture

  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)

  • Proficiency in at least one high-level programming language (e.g., Python, Go)

  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards

  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling

  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

  • Operating GPU-accelerated clusters with KubeVirt in production

  • Applying generative-AI techniques to reduce operational toil

  • Automating incidents with Shoreline or StackStorm

Show more
בואו למצוא את עבודת החלומות שלכם בהייטק עם אקספוינט. באמצעות הפלטפורמה שלנו תוכל לחפש בקלות הזדמנויות Senior Site Reliability Engineer בחברת Nvidia ב-India, Dehradun. בין אם אתם מחפשים אתגר חדש ובין אם אתם רוצים לעבוד עם ארגון ספציפי בתפקיד מסוים, Expoint מקלה על מציאת התאמת העבודה המושלמת עבורכם. התחברו לחברות מובילות באזור שלכם עוד היום וקדמו את קריירת ההייטק שלכם! הירשמו היום ועשו את הצעד הבא במסע הקריירה שלכם בעזרת אקספוינט.