Key Responsibilities:
Team Leadership and Management:
- Lead, mentor, and develop a team of SREs, fostering a culture of collaboration and continuous improvement
- Set clear goals and expectations for the team, ensuring alignment with business objectives.
- Facilitate regular team meetings and one-on-one sessions to support individual growth and team cohesion
Execution and Delivery:
- Oversee the delivery of major themes of work, ensuring high-quality execution and timely completion
- Guide the team in estimating delivery timelines and managing workloads effectively
- Provide expert guidance in debugging and systems design, encouraging innovative solutions and trade-off analysis
Risk Management:
- Assess cross-impact of team deliverables and ensure proactive communication of potential risks
- Support the team in identifying technical limitations and suggesting remediation strategies
Strategic Vision and Forward Thinking:
- Develop and implement strategic plans for building robust systems with strong contracts, anticipating future changes
- Encourage the team to propose alternative requirements and solutions that better meet organizational needs
- Set and prioritize the strategic book of work for the team in line to support goals of the business
Communication and Stakeholder Engagement:
- Communicate effectively with stakeholders, providing updates on progress and raising risks that will impact delivery
- Ensure the team is aligned with the business vision and understands the importance of their contributions to the product
Qualifications:
- Experience directly leading or functioning as a lead of technical teams, with a focus on SRE, DevOps, or infrastructure engineering
- Proficiency in programming languages (Python preferred) and distributed systems (Kubernetes, Kafka, Cassandra, etc.)
- Experience with setting up and using SLOs to track system health and performance
- Excellent problem-solving skills and creativity in debugging complex issues
- Deep understanding of cloud fundamentals and infrastructure management
- Exceptional communication skills, with the ability to articulate technical problems and solutions to diverse audiences
- A strategic mindset with a keen interest in automation and learning
- Having a thorough understanding of the full stack of the system
Our system has been working properly for the past few days in our UAT environment. We deployed a new version of core infrastructure that was tested in dev, we found it to be working & then approved it for UAT release. Suddenly, one of our services is not starting & our product or QA team cannot test changes in this environment. We receive a ping/bug report that provides high level information about what is happening, what the user would like to happen & perhaps information about what they expect to happen. We ask you to take a look at the issue.. Resolving this involves:
- Asking & communicating with the user to fully understand what the issue is
- Understanding where in the stack to begin debugging
- Constantly questioning your assumptions about the way the system should work
- Being able to ask the right questions to your peers & team to triage an issue
- Providing updates to stakeholders that are counting on you to identify or fix the problem
- Using your technical skill set to identify/reproduce the issue
- Communicating what you have found to the team so that we can best resolve the issue