Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

Nvidia Senior Staff Reliability Engineer - Compute 
United States, Texas 
993149867

01.09.2024

What you will be doing:

  • Build automation workflows for self-service and auto-healing capabilities

  • Work with teams to deploy new data center infrastructures

  • Plan and implement optimizations for compute infrastructure consumption

  • Create alerting, reports, and dashboards for monitoring overall system health and hygiene

  • Collect and review system data for capacity and planning purposes, analyze capacity data and develop plans for appropriate level enterprise-wide systems, and coordinate with management personnel in implementing changes

  • Collaborate with internal customers to solve complex problems

What we need to see:

  • 8+ years of experience in on-prem and public cloud platforms with a focus on automation of infrastructure configuration and management

  • BS degree or equivalent experience.

  • Sound knowledge in DevOps methodologies, such as CI/CD and Agile

  • Experience with design and deployment of virtualization architectures, including VMware and KubeVirt platforms.

  • Proficient with container orchestration solutions, such as Kubernetes and Docker

  • Strong proficiency in scripting languages, such as Golang or Python

  • Familiarity with configuration management platforms such as Terraform, SaltStack or Ansible

  • Ability to clearly communicate complex concepts clearly and persuasively across different audiences and varying levels of the organization

Ways to stand out from the crowd:

  • Automated maintenance processes for a large enterprise fleet of over 25,000 virtual machines

  • Built highly resilientInfrastructure-as-Code(IaC) solution providing auto-healing and fully self-service capable

You will also be eligible for equity and .