About the Role
- - - - What the Candidate Will Do ----
- Incident Management & Response: Lead cloud incident management efforts, ensuring rapid detection, triage, and resolution across all cloud platforms.
- Root Cause Analysis & SLA Compliance: Evolve key process to ensure cloud incident RCAs are completed within the agreed Service Level Agreements, track all action items, and drive continuous improvement in cloud reliability.
- Monitoring & Automation: Unify automated monitoring, alerting mechanisms, and centralized incident logging to improve detection and response times.
- Reporting & Insights: Develop targeted reporting to provide directly relevant cloud reliability insights.
- Continuous Improvement: Identify patterns in incidents, optimize response playbooks, and enhance incident management frameworks for ongoing operational resilience.
- - - - Basic Qualifications ----
- 5+ years of experience in cloud incident management, SRE, or operations.
- Expertise in a multi-cloud environments
- Experience with incident detection, response, and RCA processes
- Strong analytical and problem-solving skills, with the ability to work under pressure.
- Excellent communication and stakeholder management skills.
- - - - Preferred Qualifications ----
- Certifications in cloud platforms
- Hands-on experience with incident escalation procedures and service recovery plans.
- Experience with automated logging and forensic analysis tools
- Familiarity with SLAs, compliance, and audit processes
- Prior experience working in a highly scalable global organization
* Accommodations may be available based on religious and/or medical conditions, or as required by applicable law. To request an accommodation, please reach out to .