Designing and developing the DevOps infrastructure for business-critical systems.
Maintaining and improving container-based Kubernetes environments.
Develop and integrate observability solutions across the stack (infrastructure, application, network, and user experience) to monitor and provide actionable insights.
Work with developers and engineers to ensure that all relevant services, applications, and infrastructure components are instrumented using the latest observability best practices (e.g., logging, tracing, and metrics collection).
Set up automated alerting systems for real-time detection of performance bottlenecks, failures, or anomalies, and integrate with incident management workflows.
Build pipelines for data collection, storage, and visualisation to help the teams gain insights from monitoring data.
Use observability data to improve system reliability, availability, and performance by driving root cause analysis and continuous improvement initiatives.
Implement automated solutions for monitoring and alerting that scale with platform growth and reduce manual intervention.
Develop and maintain comprehensive documentation on monitoring, alerting, and incident response processes. Provide training and support to engineering teams to use observability tools effectively.