Incident Management, Monitoring and Alerting : Drive incident response processes and troubleshoot complex issues, ensuring timely resolution of outages. Establish monitoring, logging, and alerting best practices using tools like Datadog,...
Description:Responsibilities:
- Incident Management, Monitoring and Alerting : Drive incident response processes and troubleshoot complex issues, ensuring timely resolution of outages. Establish monitoring, logging, and alerting best practices using tools like Datadog, Site24x7 etc
- Tooling and Automation : Build essential tooling to improve reliability of systems and automated remediation of issues.
- Be a part of the on-call rotation 365x24x7.
- SOP Documentation: Create and maintain documentation for infrastructure, processes, and incident management protocols.
- Understanding of Infrastructure as Code (IaC) tools such asTerraformandAnsibleto automate the provisioning, configuration, and deployment processes.
- Attend all training programs and complete all tasks set by the supervisor and assist other trainees wherever possible.
- Cloud Platform Expertise: Hands-on with AWS cloud services, including EC2, S3, VPC, RDS, EKS, ECS, CF and more.
- CI/CD Pipelines: Fair understanding of CI/CD pipelines using tools like Jenkins.
- Monitoring and Alerting: Hands-on experience with monitoring and alerting tools like ELK, Datadog, CloudWatch, Grafana etc to proactively identify and resolve issues.
- Performance Tuning : Continuously optimize system performance, identify bottlenecks, and implement strategies to improve scalability and efficiency.
- Cost Optimization: Identify and implement strategies to reduce cloud costs while maintaining performance and reliability.
- Security Best Practices: Adhere to security best practices and implement measures to protect infrastructure and data from vulnerabilities and threats.
- Collaboration and Communication: Work effectively with cross-functional teams to understand business requirements and provide technical guidance.
Required Skills and Experience:
- 2-3 years of experience as a Site Reliability
- Strong proficiency in AWS cloud services like EC2, S3, VPC, RDS, EKS, ECS, CloudFormation and more. AWS Certification helps.
- Good Logical, Analytical and Problem-solving skills.
- Strong communication skills and Ability to work in shifts (24x7).
- Strong scripting skills (Python, PowerShell, CDK, Shell scripting).
- Understanding of infrastructure as code tools (Terraform, Ansible) and AWX Tower for Ansible automation.
- Knowledge of containerization (Docker) and orchestration platforms (Kubernetes).
- Expertise in CI/CD pipelines and automation tools (Jenkins, GitHub).
- Exposure to monitoring and alerting tools (CloudWatch, Datadog, ELK, Grafana, Site24x7).
- Documenting SOP and RCAs.
- Understanding of security best practices and compliance standards. Security Certification is a plus.