- Respond to alerts, incidents, and outages with a focus on minimizing downtime and restoring services efficiently.
- Conduct thorough Root Cause Analysis (RCA) for critical issues and implement long-term solutions to prevent recurrence..
- Design, implement, and manage monitoring solutions to gain insights into system health and performance.
- Create and maintain intuitive dashboards that provide real-time visibility into critical metrics.
- Set up proactive alerting mechanisms to detect and resolve issues before they impact end users.
- Develop robust automation scripts using tools such as Terraform , Ansible , or CloudFormation to simplify infrastructure management.
- Automate repetitive operational tasks to improve system reliability and reduce manual effort.
- Develop comprehensive runbooks for effective incident response, troubleshooting, and system recovery.
- Maintain detailed documentation for infrastructure, processes, and best practices to support team knowledge sharing.
· Strong understanding ofandoperations.
· Hands-on experience inorpractices.
· Proficiency inKubernetesfor container orchestration and cluster management.
· Proficiency inenvironments with expertise in shell scripting and understanding ofLinux internals.
· Experience in writing code usingorGofor automation and tooling.
· Practical knowledge ofandTerraformfor infrastructure automation.
· Solid experience withandfeature branching strategies.
· Expertise inandto ensure system observability and performance.
Required Education
· Bachelor’s degree in computer science engineering/information technology
Required Experience
· 1-2+ years