Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer 
India, Karnataka, Bengaluru 
600215021

31.07.2024

What you will be doing:

  • Design, implement and support large scale Kubernetes clusters with monitoring, logging and alerting.

  • Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.

  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.

  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.

  • Practice sustainable incident response and blameless postmortems.

  • Be part of an on call rotation to support production systems.

What we need to see:

  • A minimum of 3 years of hands-on experience in setup, administration and maintenance of multiple large (100+ nodes) Kubernetes clusters on-prem and Cloud Service Providers like AWS, Azure, GCP, OCI.

  • Strong coding experience in one or more of the following languages: Go, Python, Perl, Java, C, C++, Ruby.

  • Hands-on system administration experience of at least 2 years on large scale UNIX production environments, with validated debugging and troubleshooting skills.

  • Ability to maintain platform SLAs through accurate resolutions.

  • Outstanding teammate who can collaborate and influence in a multifaceted environment.

  • Demonstrable experience in handling algorithms, data structures, complexity analysis and software design.

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics).

Ways to stand out of a crowd:

  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker.

  • Demonstrated ability to automate routine tasks, debug and optimize existing code.

  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

  • Hands-on experience on network and storage administration.

  • Unit testing and benchmarking are an integral part of your code.

  • Ability to reason and choose the best possible algorithm to meet scaling and availability challenges.

  • Ability to decompose complex requirements into simple tasks and reuse available solutions to implement most of those.