Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer - DGX Cloud 
United States, Texas 
483375309

01.12.2024

What you'll be doing:

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.

  • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.

  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity

  • Practice sustainable incident response and blameless postmortems

  • Be part of an on call rotation to support production systems

What we need to see:

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.

  • 5+ years of experience.

  • Experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production

  • Experience in one or more of the following: Python, Go, Perl or Ruby

  • In depth knowledge on Linux, Networking and Containers

Ways to stand out from the crowd:

  • Interest in crafting, analyzing and fixing large-scale distributed systems.

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

  • Ability to debug and optimize code and automate routine tasks.

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker

You will also be eligible for equity and .