Key job responsibilities
- Implement observability and incident management solutions, including the use of generative AI to assist developers in diagnosis and remediation
- Establish and refine processes for effective incident management, including on-call rotations, escalation paths, and post-incident review
- Drive initiatives to improve the overall resilience and fault-tolerance of the Prime Video platform
- Partner closely with other engineering leaders to ensure availability and reliability goals are met
A day in the life
1. Team Management:
- Hold 1-on-1 meetings with direct reports to discuss progress, challenges, and development goals
2. Observability Platform Oversight:
- Review performance metrics and identify areas for improvement in the observability platform
- Oversee the roadmap and backlog for new observability features and capabilities3. Incident Management:
- Oversee the incident management process, including establishing escalation paths and post-incident review
- Analyze incident data to identify recurring issues and drive long-term reliability improvements4. Resiliency Program:
- Monitor key resiliency metrics and evaluate the effectiveness of resiliency efforts5. Stakeholder Engagement:
6. Talent Management:- Develop and retain top talent through career development plans and performance management
- Experience designing and developing large scale, high-traffic applications
משרות נוספות שיכולות לעניין אותך