Your day to day-
- Responsible for handling production incidents and driving to reduce time to recovery.
- Act as a lead and drive changes to reflect high-efficiency Incident management.
- Mentor and partner closely with cross-functional teams to enable incident handling for other brands.
- Act as a process owner for the Production Operations team and ensure consistent outcomes.
- Act as a product owner for the Platform teams and ensure automation is in place to reduce toil.
- Identify the long-running problems and work with development teams to apply permanent solutions.
What you need to bring-
- Extensive experience leading or managing a team of Site reliability engineers.
- Strong system and application triaging with mitigation expertise.
- Strong observability knowledge and hands-on experience managing dashboards & alerts using Terraform.
- Familiarity with any or multiple of the following: Node.js applications, Java, Python.
- Understanding of concepts related to a microservices architecture and system design.
- Exposure to AI concepts and machine language models. Hands-on project experience is a must-have.
- Strong verbal and written communication skills.
Travel Percent:
The total compensation for this practice may include an annual performance bonus (or other incentive compensation, as applicable), equity, and medical, dental, vision, and other benefits. For more information, visit .
The U.S. national annual pay range for this role is
$96900 to $234300
Our Benefits:
Any general requests for consideration of your skills, please