Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer 
Israel, North District 
802780476

Today
Israel, Yokneam
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you'll be doing:

  • Own the solutions you build, collaborating with cross-functional teams to successfully implement them.

  • Collaborate with various teams in a fast-paced environment to ensure seamless project completion.

  • Continuously improve solution provisioning and management through automation.

  • Detect performance issues and recommend solutions to maintain world-class service quality.

  • Conduct capacity management and planning to meet ongoing operational needs.

  • Participate in incident reviews, assist in root cause identification, and write RCA reports.

  • Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.

  • Participate in the team's on-call rotation.

What we need to see:

  • B.S. degree in Computer Science or related technical field (or equivalent experience)

  • 10+ years in building and supporting critical services and 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.

  • Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).

  • Deep understanding of Linux operating systems and TCP/IP fundamentals.

  • Expertise with at least one major cloud service provider - AWS, GCP, Azure.

  • Demonstrated proficiency with end-to-end SRE capabilities and observability.

  • Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.

  • Creative problem solver with excellent debugging skills and great communication and documentation abilities.

Ways to stand out from the crowd:

  • Linux certification from a well-known vendor - RedHat, Oracle, etc.

  • Prior experience managing large-scale Kubernetes deployment in production.

  • Strong skills in modern container networking and storage architecture.

  • Well-known Cloud Certification(s).

  • Hands-on experience working with Slurm/LSF environments.