מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

דרושים Senior Site Reliability Engineer ב-אנבידיה ב-India, Dehradun

מצאו את ההתאמה המושלמת עבורכם עם אקספוינט! חפשו הזדמנויות עבודה בתור Senior Site Reliability Engineer ב-India, Dehradun והצטרפו לרשת החברות המובילות בתעשיית ההייטק, כמו Nvidia. הירשמו עכשיו ומצאו את עבודת החלומות שלך עם אקספוינט!

חברה (1)

אופי המשרה

קטגוריות תפקיד

שם תפקיד (1)

India

Dehradun

תאריך יצירה

נמצאו 3 משרות

15.09.2025

Nvidia Senior Site Reliability Engineer DGX Cloud India, Uttarakhand, Dehradun

Limitless High-tech career opportunities - Expoint

שיתוף

התחבר/י כדי להגיש מועמדות

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting. Define SLOs/SLIs, monitor error budgets, and...

תיאור:

India, Remote

time type: Full time

posted on: Posted 15 Days Ago

job requisition id

What you’ll be doing:

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation to support production services

What we need to see:

BS in Computer Science or related technical field, or equivalent experience
10+ years of experience operating production services
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
Proficiency in at least one high-level programming language (e.g., Python, Go)
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

Operating GPU-accelerated clusters with KubeVirt in production
Applying generative-AI techniques to reduce operational toil
Automating incidents with Shoreline or StackStorm

פרטי המשרה המלאים

משרות נוספות שיכולות לעניין אותך

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud United States, Texas

07.09.2025

Nvidia Network Infrastructure Engineer India, Uttarakhand, Dehradun

שיתוף

התחבר/י כדי להגיש מועמדות

Engage in 24/7 global shift rotations to provide remote support for network repairs and changes while collaborating across teams and updating customers on status and ticket information. Drive operational improvements...

תיאור:

time type: Full time

posted on: Posted 5 Days Ago

job requisition id

What you will be doing

Engage in 24/7 global shift rotations to provide remote support for network repairs and changes while collaborating across teams and updating customers on status and ticket information.
Drive operational improvements in change management and daily operations by following procedures.
Manage and operate large scale IP network technologies and infrastructures.
Utilise your skills in Peering and Datacenter interconnect technologies: PNI, Transit, Exchange, Passive DWDM, Wave circuits.
Monitor and support the network health of on-premises and cloud infrastructures.
Collaborate and develop workflow enhancements while documenting best practices.

What we need to see

Deep knowledge and experience of TCP/IP, BGP, OSPF, MPLS, IS-IS, VxLAN, EVPN, QoS, GRE, IPsec, DNS, and MACsec.
Over 4 years of experience in network operations.
Skilled in network troubleshooting techniques and leveraging creative problem-solving abilities.
Strong track record of alert response within defined SLAs and Incident management.
Experience with one or more of the following CSP environments: AWS, Azure, GCP, OCI.
Familiarity with Arista, Fortinet and Juniper.
Hands-on experience with contributing to tooling and automation for provisioning, monitoring, and managing complex network infrastructures.
Bachelor’s degree in Computer Science, related technical field, or equivalent experience.
Excellent verbal and written communication skills.

Ways To Stand Out From The Crowd:

Working knowledge of Mellanox/Cumulus OS.
Working knowledge of Infiniband technology.
Skilled in Unix/Linux system administration, with the ability to write and understand Python/Shell scripts to enhance productivity in hyperscale environments.
Familiarity with leveraging tools such as Netbox/Nautobot, Prometheus, Grafana, Panoptes to monitor and manage a global network.
Passionate about innovating and investing in ground breaking technologies.

פרטי המשרה המלאים

משרות נוספות שיכולות לעניין אותך

27.07.2025

Nvidia Senior Site Reliability Engineer India, Uttarakhand, Dehradun

שיתוף

התחבר/י כדי להגיש מועמדות

Design, build, and implement scalable cloud-based systems for PaaS/IaaS. Work closely with other teams on new products orfeatures/improvementsof existing products. Develop, maintain and improve cloud deployment of our software. Participate...

תיאור:

India, Remote

time type: Full time

posted on: Posted 6 Days Ago

job requisition id

What you'll be doing:

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

Design, build, and implement scalable cloud-based systems for PaaS/IaaS.
Work closely with other teams on new products orfeatures/improvementsof existing products.
Develop, maintain and improve cloud deployment of our software.
Participate in the triage & resolution of complex infra-related issues
Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.
Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces
Develop, maintain and improve automation tools that can help improve efficiency of SRE operations
Practice balanced incident response and blameless postmortems
Be part of an on-call rotation to support production systems

What we need to see:

BS or MS in Computer Science or equivalent program from an accredited University/College.
8+ years of hands-on software engineering or equivalent experience.
Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.
Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.
Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics
Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI/CD.
Exhibit knowledge in concepts of working with CSPs, for example: AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.

Ways to stand out from the crowd:

Expertise in technologies such as Stack-storm, OpenStack, Redhat OpenShift, AI DBs like Milvus.
A track record of solving complex problems with elegant solutions.
Prior experience with Go & Python, React.
Demonstrate delivery of complex projects in previous roles.
Showcase ability in developing Frontend application with concepts of SSA, RBAC

פרטי המשרה המלאים

משרות נוספות שיכולות לעניין אותך

NvidiaSenior Site Reliability Engineer DGX Cloud

India, Uttarakhand, Dehradun

242085963

15.09.2025

שיתוף

התחבר/י כדי להגיש מועמדות

תיאור:

India, Remote

time type: Full time

posted on: Posted 15 Days Ago

job requisition id

What you’ll be doing:

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation to support production services

What we need to see:

BS in Computer Science or related technical field, or equivalent experience
10+ years of experience operating production services
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
Proficiency in at least one high-level programming language (e.g., Python, Go)
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

Operating GPU-accelerated clusters with KubeVirt in production
Applying generative-AI techniques to reduce operational toil
Automating incidents with Shoreline or StackStorm

פרטי המשרה המלאים

משרות נוספות שיכולות לעניין אותך

Nvidia Senior Site Reliability Engineer DGX Cloud United States, California

Nvidia Senior Site Reliability Engineer - DGX Cloud United States, Texas

Nvidia Senior Site Reliability Engineer BCM - DGX Cloud United States, Texas

כלי לבניית קורות חיים מקצועיים מבית אקספוינט

הצטרפו למאות שיצרו קורות חיים ושדרגו את הקריירה שלהם

צרו קו"ח

בואו למצוא את עבודת החלומות שלכם בהייטק עם אקספוינט. באמצעות הפלטפורמה שלנו תוכל לחפש בקלות הזדמנויות Senior Site Reliability Engineer בחברת Nvidia ב-India, Dehradun. בין אם אתם מחפשים אתגר חדש ובין אם אתם רוצים לעבוד עם ארגון ספציפי בתפקיד מסוים, Expoint מקלה על מציאת התאמת העבודה המושלמת עבורכם. התחברו לחברות מובילות באזור שלכם עוד היום וקדמו את קריירת ההייטק שלכם! הירשמו היום ועשו את הצעד הבא במסע הקריירה שלכם בעזרת אקספוינט.