Expoint – all jobs in one place
מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
Limitless High-tech career opportunities - Expoint

Palo Alto Manager Site Reliability Engineering Technical Incidents - Cortex 
India, Karnataka, Bengaluru 
360320747

26.08.2025

Being the cybersecurity partner of choice, protecting our digital way of life.

Your Career

We’re seeking an experienced Cloud SRE lead to lead high-severity incident and problem management across our GCP-centric platforms. This role combines deep technical troubleshooting with process ownership, ensuring rapid recovery, root cause elimination, and long-term reliability improvements. You will own L3 OnCall responsibilities, drive post-incident learning, and champion automation and operational excellence.

Implement and lead post-mortem processes within SLAs, identify root causes, and drive corrective actions to reduce repeat incidents.

Your Impact :

  • In your technical and leadership capacity you will contribute to a seamless production site reliability operations , partnering closely with regional and global SRE counterparts with special attention to the below
  • Incident Analysis & Problem Management: Implement and lead post-mortem processes within SLAs, identify root causes, and drive corrective actions to reduce repeat incidents. Establish and maintain a problem backlog, ensuring timely resolution and continuous process improvement.
  • Troubleshooting: Rapidly diagnose and resolve failures across Kubernetes, Terraform, and GCP using advanced troubleshooting frameworks.
  • Preventative Measures: Implement automation and enhanced monitoring to proactively detect issues and reduce incident frequency.
  • Stakeholder Communication: Work with GCP / AWS TAMs and othre vendors to request new features or followups for updates.
  • Mentorship: Coach and elevate SRE and DevOps teams, promoting best practices in reliability and incident/problem management.
  • Documentation: Establish and maintain a problem backlog, ensuring timely resolution and continuous process improvement.
  • Envision the future or SRE with AI/ML : Ability to envision how a modern SRE team should operate leveraging AI/ML

Your Experience

  • 12+ years of experience in SRE/DevOps/Infrastructure roles, with a strong foundation in cloud-based environments.
  • 5+ years of proven experience managing SRE/DevOps teams, preferably with a strong focus on Google Cloud Platform (GCP).
  • Deep hands-on knowledge of Terraform, Kubernetes (GKE), GitLab CI/CD, and modern observability practices (e.g., Prometheus, OpenTelemetry).
  • Strong experience in managing incident response and postmortems, reducing MTTR, and driving proactive reliability improvements.
  • Proficiency with cloud platforms such as GCP & AWS.
  • Solid grasp of Infrastructure as Code, container orchestration, and scalable cloud architectures.
  • Track record of building tools for system reliability, automated remediation, and performance tuning.
  • Experience leveraging AI/ML-based operations tools for automation, anomaly detection, and predictive alerting is a plus.
  • Expertise in SLI/SLO/SLA design and implementation, and driving operational maturity through data.
  • Strong interpersonal and leadership skills, with a demonstrated ability to coach, mentor, and inspire teams.
  • Effective communicator, capable of translating complex technical concepts to non-technical stakeholders.
  • Committed to inclusion, collaboration, and creating a culture where every voice is heard and respected.

All your information will be kept confidential according to EEO guidelines.