Being the cybersecurity partner of choice, protecting our digital way of life.
Your Career
We’re seeking an experienced Cloud SRE lead to lead high-severity incident and problem management across our GCP-centric platforms. This role combines deep technical troubleshooting with process ownership, ensuring rapid recovery, root cause elimination, and long-term reliability improvements. You will own L3 OnCall responsibilities, drive post-incident learning, and champion automation and operational excellence.
Implement and lead post-mortem processes within SLAs, identify root causes, and drive corrective actions to reduce repeat incidents.
Your Impact :
- In your technical and leadership capacity you will contribute to a seamless production site reliability operations , partnering closely with regional and global SRE counterparts with special attention to the below
- Incident Analysis & Problem Management: Implement and lead post-mortem processes within SLAs, identify root causes, and drive corrective actions to reduce repeat incidents. Establish and maintain a problem backlog, ensuring timely resolution and continuous process improvement.
- Troubleshooting: Rapidly diagnose and resolve failures across Kubernetes, Terraform, and GCP using advanced troubleshooting frameworks.
- Preventative Measures: Implement automation and enhanced monitoring to proactively detect issues and reduce incident frequency.
- Stakeholder Communication: Work with GCP / AWS TAMs and othre vendors to request new features or followups for updates.
- Mentorship: Coach and elevate SRE and DevOps teams, promoting best practices in reliability and incident/problem management.
- Documentation: Establish and maintain a problem backlog, ensuring timely resolution and continuous process improvement.
- Envision the future or SRE with AI/ML : Ability to envision how a modern SRE team should operate leveraging AI/ML
Your Experience
- 12+ years of experience in SRE/DevOps/Infrastructure roles, with a strong foundation in cloud-based environments.
- 5+ years of proven experience managing SRE/DevOps teams, preferably with a strong focus on Google Cloud Platform (GCP).
- Deep hands-on knowledge of Terraform, Kubernetes (GKE), GitLab CI/CD, and modern observability practices (e.g., Prometheus, OpenTelemetry).
- Strong experience in managing incident response and postmortems, reducing MTTR, and driving proactive reliability improvements.
- Proficiency with cloud platforms such as GCP & AWS.
- Solid grasp of Infrastructure as Code, container orchestration, and scalable cloud architectures.
- Track record of building tools for system reliability, automated remediation, and performance tuning.
- Experience leveraging AI/ML-based operations tools for automation, anomaly detection, and predictive alerting is a plus.
- Expertise in SLI/SLO/SLA design and implementation, and driving operational maturity through data.
- Strong interpersonal and leadership skills, with a demonstrated ability to coach, mentor, and inspire teams.
- Effective communicator, capable of translating complex technical concepts to non-technical stakeholders.
- Committed to inclusion, collaboration, and creating a culture where every voice is heard and respected.
All your information will be kept confidential according to EEO guidelines.