Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior SRE Engineer 
Israel, Tel Aviv District, Tel Aviv-Yafo 
235362901

Today
Israel, Tel Aviv
Israel, Raanana
time type
Full time
posted on
Posted 3 Days Ago
job requisition id

What you'll be doing:

  • The person will be part of the NVIDIA AIR team that is building the SaaS/IaaS platform for digital twin of AI data centers.

  • The responsibility specifically is for infrastructure and Site Reliability Engineering (SRE) requirements for AIR.

  • Focus on efficiency by automating repetitive workflows.

  • Working on microservices based architecture.

  • Deploying and troubleshooting non-disruptive cloud operations with an emphasis on secure production infrastructure.

  • Continuous evaluation of existing system and driving improvements.

  • Managing deployment/upgrade for Operating Systems, Kubernetes(k8s) clusters and/or or other orchestration tools.

  • Day to day support for engineering activities with CI/CD tools like git, Jenkins.

  • Efficiently multi-tasking on the different tracks to efficiently address evolving priorities .

What we need to see:

  • BSc in Engineering/ Relevant Certifications/ equivalent experience.

  • 5+ years of experience in complex microservices basedarchitectures

  • Proven experiencein best practices and discipline of managing and monitoring a highly available and secure production infrastructure

  • Experienced with latest Observabilty tools, Prometheous stack, Data Dog, etc

  • Experienced with modern deployment architecture for non-disruptive cloud operations including blue green and canary rollouts

  • Highly skilled in Kubernetes and Docker

  • Experience in IaaS environment - deploying, configuring, and administering Linux-based bare metal servers

  • Experience with relational databases(MySQL) and SQL.

  • Expert in AWS

Ways to stand out from the crowd:

  • Skills in Linux/Unix Administration

  • Experience with Prometheus/Grafana.

  • Experience with APM tools like Dynatrace, Datadog, AppDynamics, New Relic, etc.

  • Implemented robust metrics collection and alerting