Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer 
United States, California 
574192314

16.09.2025
US, CA, Santa Clara
time type
Full time
posted on
Posted 6 Days Ago
job requisition id

What you'll be doing:

  • Develop framework and scripts to automate workflows and deployments in the cloud environment.

  • Deploy and maintain a large farm of machines using the latest Configuration Management & Infrastructure Automation tools (Chef, Ansible, Terraform).

  • Develop extensive monitoring systems to have fast, reliable and real-time pulse of the various infrastructure subsystems (Zabbix, Grafana, Prometheus)

  • Participate in on-call & rotational L1 support for round-the-clock monitoring and remediation of the infrastructure.

  • Solve complex problems involving infrastructure scaling, capacity and planning. Analyze and Debug operating system, networking, configuration and performance problems.

  • Assist in roll-out and deployment of new development features aimed at supporting the latest Nvidia hardware and technologies

  • Develop SRE agents that will help streamline daily Cost of Business activities, reduce toil and improve operational efficiency

What we need to see:

  • Bachelor's or Master's Degree in Computer Science or Software Engineering, or equivalent experience.

  • Familiar with implementing load balancing strategies, disaster recovery planning, business continuity best practices, and designing scalable, resilient systems based on SRE principles

  • Ability to debug and analyze source code to triage, root cause and resolve issues in the infrastructure. Work closely with the development team in improving the build and test systems.

  • Hands-on coding experience with any of Python, Go. Unix shell proficiency. Knowledge of Java, C

  • Experience with version control systems like Perforce, GIT.

  • Demonstrable experience working in large scale enterprise production systems.

  • 8+ years of operational experience required

Ways to stand out from the crowd:

  • Experience with public clouds (AWS, GCP, Azure), VM and container virtualization technologies like VMware, KVM, Docker and Kubernetes.

  • Background with automating bare metal and VM provisioning

  • Experience with supporting GPUs, embedded device development, driver development and CUDA/TensorRT applications.

You will also be eligible for equity and .