Essential Responsibilities:
- Manage and mentor a team of site reliability engineers, setting performance objectives, providing technical guidance, and ensuring alignment with business goals.
- Oversee the execution of reliability initiatives, ensuring critical systems maintain high availability, resilience, and performance at scale.
- Work with engineering, operations, and product teams to ensure seamless integration of reliability best practices into the development, deployment, and operational processes.
- Lead incident management activities, including coordination of response efforts, root cause analysis, and implementing solutions to prevent future incidents.
- Define and track key performance indicators (KPIs) related to system reliability, availability, and performance, reporting results to leadership regularly.
- Promote and drive automation within the site reliability engineering team, ensuring processes are streamlined and systems operate with minimal manual intervention.
- Manage capacity planning efforts, ensuring the scalability of systems and the ability to handle increasing traffic and resource demands effectively.
- Ensure the development and testing of disaster recovery plans and procedures, minimizing downtime in the event of a failure.
- Lead career development and mentorship efforts for team members, ensuring engineers have the tools and opportunities to grow their skills and advance their careers.
- Work closely with leadership to align site reliability engineering goals with broader organizational objectives, ensuring engineering efforts support business continuity and growth.
Expected Qualifications:
- Minimum of 12 years of relevant work experience and a Bachelor's degree or equivalent experience.
- Previous management experience
Preferred Qualification:
Role Overview:
Key Responsibilities:
- Lead and manage the Site Reliability engineering team, providing guidance on escalated technical issues and complex infrastructure challenges
- Oversee 24/7 monitoring and management of multi/hybrid cloud and on-premises infrastructure, ensuring optimal performance and availability
- Develop and implement incident management procedures, incident response protocols, and escalation frameworks for infrastructure and application issues
- Collaborate with cross-functional teams including DevOps, Security, and Application Development to resolve critical incidents and implement preventive measures
- Manage vendor relationships and coordinate with external partners for infrastructure services and support
- Drive continuous improvement initiatives to enhance system reliability, disaster recovery/BCP, performance, and operational efficiency
- Provide technical leadership during major incidents, coordinating response efforts and post-incident analysis
- Develop team capabilities through mentoring, training, and knowledge sharing programs
- Prepare and present operational metrics, incident reports, and improvement recommendations to senior leadership
- Knowledge of financial cyber security protocols (ISO 27001, SOC2) and incident detection and response frameworks.
Required Qualifications:
- Bachelor’s degree in computer science, Information Technology, or related field; Master's preferred
- 12+ years of experience in infrastructure management, with at least 3 years in a leadership role
- Extensive experience with multiple cloud platforms (AWS, Azure, GCP) and on-premises infrastructure management
- Strong background in incident management, ITIL frameworks, and operational best practices
- Experience with monitoring tools, automation platforms, andinfrastructure-as-codetechnologies
- Proven track record of leading technical teams in high-pressure, mission-critical environments
- Excellent communication skills with ability to interact effectively with technical teams and executive leadership
- Experience with enterprise applications, databases, and network infrastructure
- Strong analytical and problem-solving skills with attention to detail.
Preferred Qualifications:
- Relevant certifications (ITIL, cloud provider certifications, PMP)
- Deep Knowledge of financial regulatory requirements (preferably)
- Background in DevOps practices and CI/CD pipeline management.
- Experience with containerization technologies and orchestration platforms.
Our Benefits:
Any general requests for consideration of your skills, please