Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer - GeForce 
United States, Texas 
596276308

31.07.2024

The person in this position will be responsible for Service Response and Workflows and will drive tools/service development to maintain and improve service SLOs. We partner with Service Owners to drive reliability of the service. The GFN Service is an exciting service in the newly growing game streaming industry.

What you will be doing:

  • Working on building tools to improve the SRE Observability.

  • Be part of Kubernetes migration journey with VMI setup and problem solving.

  • Rapidly debug and triage incidents and user-reported issues

  • Taking ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks

  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.

  • Be part of an on call rotation to support production systems

What we need to see:

  • MS or BS in Computer Science/Engineering or a related field or equivalent experience.

  • 8+ year’s Site reliability engineering experience working on large scale distributed micro services in a production environment with a real passion for automation and tooling.

  • Very strong Kubernetes background and ability to understand Kubernetes with complex and highly available VMI setup on K8's.

  • Lead significant production improvements including change management, post-mortem reviews, workflow processes, design and deliver software automation in various languages.

  • Confirmed strengths in problem-solving and root causing issues, while continuously seeking ways to drive optimization, efficiency and the bottom line.

Ways to stand out from the crowd:

  • Previous experience with Datadog, Prometheus, alert manager or similar monitoring systems.

  • Jenkins (or similar CI/CD) setup, configuration, deployment is a requirement

  • Excellent communication, presentation, social, and analytical skills; the ability to communicate complex interaction concepts clearly and persuasively across different audiences and varying levels of the organization.

  • Experience with Stack Storm, Prometheus, and Kubernetes and similar are bonuses.

  • Prior experience as an SRE or Service Engineering is a huge plus.

You will also be eligible for equity and .