Finding the best job has never been easier

Nvidia Senior Staff Reliability Engineer - Compute
United States, Texas
993149867

01.09.2024

What you will be doing:

Build automation workflows for self-service and auto-healing capabilities
Work with teams to deploy new data center infrastructures
Plan and implement optimizations for compute infrastructure consumption
Create alerting, reports, and dashboards for monitoring overall system health and hygiene
Collect and review system data for capacity and planning purposes, analyze capacity data and develop plans for appropriate level enterprise-wide systems, and coordinate with management personnel in implementing changes
Collaborate with internal customers to solve complex problems

What we need to see:

8+ years of experience in on-prem and public cloud platforms with a focus on automation of infrastructure configuration and management
BS degree or equivalent experience.
Sound knowledge in DevOps methodologies, such as CI/CD and Agile
Experience with design and deployment of virtualization architectures, including VMware and KubeVirt platforms.
Proficient with container orchestration solutions, such as Kubernetes and Docker
Strong proficiency in scripting languages, such as Golang or Python
Familiarity with configuration management platforms such as Terraform, SaltStack or Ansible
Ability to clearly communicate complex concepts clearly and persuasively across different audiences and varying levels of the organization

Ways to stand out from the crowd:

Automated maintenance processes for a large enterprise fleet of over 25,000 virtual machines
Built highly resilientInfrastructure-as-Code(IaC) solution providing auto-healing and fully self-service capable

You will also be eligible for equity and .

These jobs might be a good fit

Apple Senior Compute Site Reliability Engineer GPU United States, Washington, Seattle

Get to the top of the "yes list" with a standout CV!

CREATE CV