Share
What you'll be doing:
Cultivate a top-performing team of Network Site Reliability Engineers through encouraging a culture of collaboration, accountability, and technical excellence, along with offering mentorship.
Manage the design, implementation, and maintenance of robust and scalable network infrastructure across data centers, cloud environments, and edge locations to ensure consistent connectivity and performance.
Apply proactive reliability engineering techniques to reduce network disruptions and decrease Mean Time to Recovery (MTTR), improving overall service reliability and user satisfaction.
Work closely with Security and Compliance teams to ensure that all network infrastructure meets regulatory standards and internal policies, maintaining a secure operational environment.
Lead initiatives to improve network observability by integrating advanced monitoring and alerting systems, collaborating with multi-functional teams to implement network solutions that support business objectives and enhance user experiences.
What we need to see:
Bachelor’s or Master’s degree in Computer Science or a related field, or equivalent experience.
12+ overall years of proven experience in host and infrastructure networking
6+ years in leadership roles managing teams focused on high-performance Software Defined Networking (SDN) solutions.
Strong understanding of networking protocols, with hands-on experience in kernel development and key technologies like routing, switching, load balancers, firewalls, VPNs, and cloud platforms such as AWS, GCP, and Azure.
Skilled in Infrastructure as Code (IaC) using automation tools like Ansible and Terraform, along with monitoring tools such as Prometheus, Grafana, and NetBox to improve network performance.
Proven ability to design network architectures for cloud and distributed systems, with practical experience in large-scale configurations and familiarity with SR-IOV, Xen virtualization, and Open Virtual Switch or similar SDN technologies.
Ways to stand out from the crowd:
Extensive experience in managing hybrid cloud environments and large-scale distributed systems, showcasing effective infrastructure management skills.
Strong understanding of Site Reliability Engineering (SRE) concepts, including SLAs, SLOs, and incident management best practices.
Proven ability to use operational signals like SNMP, Syslog, and Streaming Telemetry for efficient issue identification and resolution.
Comprehensive knowledge of Open Virtual Switch (OVS) and SR-IOV RDMA for effective network management and optimization.
Experience in debugging and improving code, automating repetitive tasks, and working with Mellanox/Cumulus Linux, Palo Alto firewalls, and Netscaler load balancers
You will also be eligible for equity and .
These jobs might be a good fit