Job responsibilities
- Lead and mentor a team of SRE/DevOps engineers, fostering a culture of collaboration, continuous improvement, and innovation.
- Provide technical guidance and support to team members, ensuring high standards of performance and professional development.
- Design, implement, and manage scalable, secure, and highly available cloud infrastructure and application hosted on AWS.
- Utilize Infrastructure as Code (IaC) tools such as Terraform to automate the provisioning and management of infrastructure resources.
- Ensure the reliability and performance of Kubernetes clusters and associated services.
- Implement and maintain observability solutions to monitor system performance, detect anomalies, and ensure uptime.
- Utilize tools such as Prometheus, Grafana, Splunk/ELK, or similar to provide actionable insights and proactive issue resolution.
- Develop and maintain CI/CD pipelines to automate the build, test, and deployment processes.
- Implement automation and tools to streamline operations and reduce manual intervention.
- Collaborate with development, QA, and product teams to ensure seamless integration and delivery of new features and updates.
- Communicate effectively with stakeholders to provide updates on system status, incidents, and improvements.
- Lead incident response efforts, perform root cause analysis, and implement corrective actions to prevent recurrence.
- Troubleshoot complex infrastructure and application issues, ensuring timely resolution and minimal impact on operations.
- Adds to team culture of diversity, equity, inclusion, and respect
Required qualifications, capabilities, and skills
- Formal training or certification in SRE/Dev Ops concepts and 5+ years of applied experience with a strong background in cloud infrastructure, automation, and CI/CD.
- Expertise in AWS services and architecture & pProficiency in Terraform for Infrastructure as Code (IaC).
- Extensive experience with Kubernetes for container orchestration.
- Strong knowledge of observability tools and practices.
- Excellent scripting and automation skills (e.g., Python, Bash, Go).
- Strong problem-solving skills and the ability to troubleshoot complex issues.
Excellent communication and leadership skills. - Advanced understanding of agile methodologies such as CI/CD, Application Resiliency, and Security
- Demonstrated proficiency in software applications and technical processes within a technical discipline (e.g., cloud, artificial intelligence, machine learning, mobile, etc.)
- Practical cloud native experience
Preferred qualifications, capabilities, and skills
- AWS Certified Solutions Architect or similar certifications.
- Experience with other cloud platforms (e.g., GCP, Azure).
- Familiarity with security best practices and compliance standards.
- Experience with configuration management tools (e.g., Ansible, Chef, Puppet).