Team Leadership: Manage and mentor a team of Automation SREs, fostering a culture of collaboration, innovation, and excellence in execution.
Technical Guidance: Own technical decisions for the team, ensuring alignment with developers and employing industry standard methodologies
Operational Excellence: Implement and maintain robust operational practices, including incident management, monitoring, alerting, and capacity planning
Shift Scheduling: Coordinate follow-the-sun support across global time zones, ensuring 24/7 coverage and efficient handovers
Project Management: Lead initiatives related to the design, deployment, and maintenance of critical infrastructure components
Release Management: Oversee release processes and ensure smooth deployments, minimizing downtime and impact on users
Root Cause Analysis: Conduct thorough post-incident reviews, identifying root causes and implementing preventive measures
What we need to see:
8+ years of experience in the industry, with a focus on Site Reliability Engineering, with a strong background in cloud service providers, ISPs, or similar service-oriented networking companies
Technical Skills: Proficiency in managing distributed web infrastructures, designing scalable and resilient systems, and implementing network automation
Leadership: Proven track record of managing technical teams, including performance management, career development, and hiring - 2+ yrs of management experience
Problem Solving: Demonstrated ability to conduct detailed root cause analysis and drive improvements based on findings
Communication: Excellent verbal and written communication skills, with experience presenting technical information to diverse audiences
Education: Bachelor’s degree in Computer Science, Engineering, or a related technical field, or relevant industry experience