Develop automated solutions to address potential problems before they result in a service interruption.
Provide impact assessment and mitigation plan for changes going into the production environment.
Investigate root cause of severe and systemic outages, identify corrective actions and apply across the enterprise.
Develop availability measures that align with consumer experience to accurately assess the usability of crucial services.
Build capacity models to baseline transactional load compared to resource performance and leverage data to predict overall system capacity while automating load placement to avoid outages.
Identify thresholds for all critical links in the data path to quickly isolate where imbalances may result in potential outages.
Analyse failure points in services to model risk level and resolution steps if failure occurs.
Assist in driving architecture enhancements into system to mitigate potential failure points.
Programmatically monitor for and remediate configuration drift of critical devices.
Develop response plans to potential failure points and evaluate effectiveness during planned tests.
Perform comprehensive operational health checks of the entire services to identify areas of concern and track activities to drive improvements at all levels of the architecture.
Provide technical coaching and direction to more junior teammates.
Qualifications/Essential Requirements
Bachelor's Degree in Computer Science or STEM” Majors (Science, Technology, Engineering and Math) with at least 10 years of progressive experience.
Experience in site reliability engineering, with a focus on AWS.
Strong understanding of AWS Services, architecture, and best practices.
Experience with configuring, customizing, and extending monitoring /APM tools (Datadog, Kloudfuse, Grafana, Splunk, etc.)
Operational experience in complex distributed systems, including defining, measuring and monitoring SLO/SLAs for availability and reliability goals.
Experience with incident management and post-incident reviews.
AWS Certified Solutions Architect Associate, AWS Certified DevOps Engineer is a plus.
Preferred Qualification
Expertise on management & administration of Kubernetes clusters.
Strong background in scripting, automation, configuration management, and infrastructure-as-code practices (Terraform AWS CloudFormation, Crossplane, Pulumi etc.)
Good understanding of DevOps practices, CI/CD pipelines, version control systems (Git). Experience in GitOps is a plus.
Strong knowledge on Unix based operating systems & workload management and networking systems.