Expoint – all jobs in one place
The point where experts and best companies meet
Limitless High-tech career opportunities - Expoint

Nvidia Senior Manager Network Site Reliability - GeForce 
United States, California 
484717548

12.08.2025
US, CA, Santa Clara
time type
Full time
posted on
Posted 7 Days Ago
job requisition id

What you'll be doing:

  • Cultivate a top-performing team of Network Site Reliability Engineers through encouraging a culture of collaboration, accountability, and technical excellence, along with offering mentorship.

  • Manage the design, implementation, and maintenance of robust and scalable network infrastructure across data centers, cloud environments, and edge locations to ensure consistent connectivity and performance.

  • Apply proactive reliability engineering techniques to reduce network disruptions and decrease Mean Time to Recovery (MTTR), improving overall service reliability and user satisfaction.

  • Work closely with Security and Compliance teams to ensure that all network infrastructure meets regulatory standards and internal policies, maintaining a secure operational environment.

  • Lead initiatives to improve network observability by integrating advanced monitoring and alerting systems, collaborating with multi-functional teams to implement network solutions that support business objectives and enhance user experiences.

What we need to see:

  • Bachelor’s or Master’s degree in Computer Science or a related field, or equivalent experience.

  • 12+ overall years of proven experience in host and infrastructure networking

  • 6+ years in leadership roles managing teams focused on high-performance Software Defined Networking (SDN) solutions.

  • Strong understanding of networking protocols, with hands-on experience in kernel development and key technologies like routing, switching, load balancers, firewalls, VPNs, and cloud platforms such as AWS, GCP, and Azure.

  • Skilled in Infrastructure as Code (IaC) using automation tools like Ansible and Terraform, along with monitoring tools such as Prometheus, Grafana, and NetBox to improve network performance.

  • Proven ability to design network architectures for cloud and distributed systems, with practical experience in large-scale configurations and familiarity with SR-IOV, Xen virtualization, and Open Virtual Switch or similar SDN technologies.

Ways to stand out from the crowd:

  • Extensive experience in managing hybrid cloud environments and large-scale distributed systems, showcasing effective infrastructure management skills.

  • Strong understanding of Site Reliability Engineering (SRE) concepts, including SLAs, SLOs, and incident management best practices.

  • Proven ability to use operational signals like SNMP, Syslog, and Streaming Telemetry for efficient issue identification and resolution.

  • Comprehensive knowledge of Open Virtual Switch (OVS) and SR-IOV RDMA for effective network management and optimization.

  • Experience in debugging and improving code, automating repetitive tasks, and working with Mellanox/Cumulus Linux, Palo Alto firewalls, and Netscaler load balancers

You will also be eligible for equity and .