Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

Nvidia Senior Site Reliability Engineer 
United States, California 
82563783

01.09.2024

The cloud hosts a heterogeneous mix of machines and devices with various operating systems(Windows/Linux/Android),


What you'll be doing:

  • Develop frameworks and scripts to automate workflows and deployments in a private cloud environment that houses several compute servers with NVIDIA GPUs.

  • Specific focus on building and stabilizing our virtualization infrastructure of ESXi, KVM and Hyper-V.

  • Deploy and maintain a large farm of machines using the latest Configuration Management & Infrastructure Automation tools (Chef, Ansible, Terraform).

  • Develop extensive monitoring systems to have fast, reliable and real-time pulse of the various infrastructure subsystems (Zabbix, Big Panda, Grafana).

  • Participate in on-call & rotational L1 support for round-the-clock monitoring and remediation of the infrastructure. (PagerDuty)

  • Tackle sophisticated problems involving infrastructure scaling, capacity and planning.

  • Analyze and Debug operating system, networking, configuration and performance problems.

  • Assist in roll-out and deployment of new development features sought at supporting the latest NVIDIA hardware and technologies.

What we need to see:

  • Bachelor's or Master's Degree in Computer Science or Software Engineering, or equivalent experience.

  • Proven experience working in large scale enterprise production systems. 6+ years of professional experience required.

  • Ability to debug and analyze source code to triage, root cause and resolve issues in the infrastructure. Work closely with the platform engineering team in understanding hardware setups.

  • Familiar with maintenance and setup of Linux, Windows hosts

  • Hands-on coding experience with any of Python, Go. Unix shell proficiency. Knowledge of Java, C.

  • Experience with version control systems like Perforce, GIT.

Ways to stand out from the crowd:

  • Experience with VM and hardware virtualization technologies like VMware, KVM, Hyper-V, Docker and Kubernetes.

  • Background with automating bare metal and VM provisioning.

  • Experience with supporting GPUs, embedded device development, driver development and CUDA/TensorRT applications.

  • Development experience in Chef, Ansible and infrastructure orchestration.

You will also be eligible for equity and .