Lead and manage the Site Reliability Engineering team, providing guidance on escalated technical issues and complex infrastructure challenges
Oversee 24/7 monitoring and management of multi/hybrid cloud and on-premises infrastructure, ensuring optimal performance and availability
Develop and implement incident management procedures, incident response protocols, and escalation frameworks for infrastructure and application issues
Collaborate with cross-functional teams including DevOps, Security, and Application Development to resolve critical incidents and implement preventive measures
Manage vendor relationships and coordinate with external partners for infrastructure services and support
Drive continuous improvement initiatives to enhance system reliability, disaster recovery/BCP, performance, and operational efficiency
Provide technical leadership during major incidents, coordinating response efforts and post-incident analysis
Develop team capabilities through mentoring, training, and knowledge sharing programs
Prepare and present operational metrics, incident reports, and improvement recommendations to senior leadership
Required Qualifications:
Bachelor’s degree in computer science, Information Technology, or related field; Master's preferred
8+ years of experience in infrastructure management, with at least 3 years in a leadership role
Extensive experience with multiple cloud platforms (AWS, Azure, GCP) and on-premises infrastructure management
Strong background in incident management, ITIL frameworks, and operational best practices
Experience with monitoring tools, automation platforms, andinfrastructure-as-codetechnologies
Proven track record of leading technical teams in high-pressure, mission-critical environments
Excellent communication skills with ability to interact effectively with technical teams and executive leadership
Experience with enterprise applications, databases, and network infrastructure
Strong analytical and problem-solving skills with attention to detail.