Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Nvidia SRE Engineer - Air Platform Team 
United States, North Carolina, Durham 
939541468

Today
US, NC, Durham
time type
Full time
posted on
Posted 3 Days Ago
job requisition id

What you'll be doing:

  • Design, deploy, and manage IaaS platforms with a focus on high availability and performance.

  • Automate infrastructure operations using tools like Terraform, Ansible, and Python.

  • Focus on efficiency by automating repetitive workflows.

  • Develop monitoring and observability tooling to detect and prevent outages using Prometheus, Grafana, ELK, etc.

  • Deploy and troubleshoot non-disruptive cloud operations with an emphasis on secure production infrastructure.

  • Manage deployment/upgrades for Operating Systems, Kubernetes (k8s) clusters, and other orchestration tools.

  • Provide day-to-day support for engineering activities with CI/CD tools like Git and Jenkins.

  • Implement and enforce best practices around infrastructure security, access control, and operational efficiency.

What we need to see:

  • BS degree in Computer Science, Software Engineering, or a related field (or equivalent experience).

  • 3–5+ years of experience in a Site Reliability, DevOps, or Systems Engineering role.

  • Strong automation and scripting skills in Ansible, Python, and Shell Scripting.

  • Experience in IaaS environments, including deploying, configuring, and administering Linux-based bare metal servers.

  • Deep experience in infrastructure engineering, focused on managing and monitoring a highly available production infrastructure.

  • Skilled in observability practices, using Prometheus, Grafana, ELK/EFK, and integrated alerting systems.

  • Solid grasp of Linux internals and core networking concepts including NAT, DNS, DHCP, routing, and firewall configuration with iptables or nftables.

  • Experience with modern deployment architecture for non-disruptive cloud operations, including blue-green and canary rollouts.

  • Proficiency in Kubernetes, Docker, QEMU, and Libvirt.

Ways to stand out from the crowd:

  • Hands-on expertise with AWS, including deploying complex, load-balanced, and highly available workloads.

  • Proficiency in debugging network issues in both infrastructure and SDN.

  • Experience with performance tuning and benchmarking across storage, compute, or networking.

  • Implemented robust metrics collection and alerting infrastructure.

  • Familiar with compliance standards such as FedRAMP, HIPAA, and SOC 2.

You will also be eligible for equity and .