Incident Management, Monitoring and Alerting : Drive incident response processes and troubleshoot complex issues, ensuring timely resolution of outages. Establish monitoring, logging, and alerting best practices using tools like Datadog, Site24x7 etc
Tooling and Automation : Build essential tooling to improve reliability of systems and automated remediation of issues.
Be a part of the on-call rotation 365x24x7.
SOP Documentation: Create and maintain documentation for infrastructure, processes, and incident management protocols.
Understanding of Infrastructure as Code (IaC) tools such asTerraformandAnsibleto automate the provisioning, configuration, and deployment processes.
Attend all training programs and complete all tasks set by the supervisor and assist other trainees wherever possible.
Cloud Platform Expertise: Hands-on with AWS cloud services, including EC2, S3, VPC, RDS, EKS, ECS, CF and more.
CI/CD Pipelines: Fair understanding of CI/CD pipelines using tools like Jenkins.
Monitoring and Alerting: Hands-on experience with monitoring and alerting tools like ELK, Datadog, CloudWatch, Grafana etc to proactively identify and resolve issues.
Performance Tuning : Continuously optimize system performance, identify bottlenecks, and implement strategies to improve scalability and efficiency.
Cost Optimization: Identify and implement strategies to reduce cloud costs while maintaining performance and reliability.
Security Best Practices: Adhere to security best practices and implement measures to protect infrastructure and data from vulnerabilities and threats.
Collaboration and Communication: Work effectively with cross-functional teams to understand business requirements and provide technical guidance.
Required Skills and Experience:
2-3 years of experience as a Site Reliability
Strong proficiency in AWS cloud services like EC2, S3, VPC, RDS, EKS, ECS, CloudFormation and more. AWS Certification helps.
Good Logical, Analytical and Problem-solving skills.
Strong communication skills and Ability to work in shifts (24x7).