Job responsibilities
- Design, implement, and manage scalable, reliable, and secure cloud infrastructure on AWS, including deploying and scaling containerized applications using Kubernetes (EKS) and ECS.
- Develop and maintain infrastructure as code using Terraform to automate provisioning and configuration management, ensuring efficient and consistent deployments.
- Monitor system performance, optimize EKS workloads, and implement solutions to improve reliability and performance, including autoscaling and disaster recovery strategies.
- Implement logging and tracing using tools like ELK Stack, Splunk, Dynatrace, and AWS CloudWatch to ensure comprehensive monitoring and alerting.
- Integrate security tools such as SonarQube, Snyk, Trivy, and Aqua Security into CI/CD pipelines using Jenkins or AWS CodePipeline, and define automated rollback policies in Spinnaker.
- Collaborate with development teams to ensure smooth deployment and operation of applications, implementing and managing CI/CD pipelines to streamline the software development lifecycle.
- Troubleshoot and resolve infrastructure-related issues promptly, while continuously evaluating and implementing modern technologies and tools to improve operational efficiency.
Required qualifications, capabilities, and skills
- Formal training or certification on Site Reliability concepts and 3+ years applied experience
- Strong expertise in AWS services, including EC2, S3, RDS, VPC, IAM, and networking, with hands-on experience in Kubernetes (EKS) and ECS for container orchestration.
- Proficiency in using Terraform for infrastructure as code and a solid understanding of CI/CD concepts and tools such as Jenkins, GitLab CI, CircleCI, AWS CodePipeline, and Spinnaker.
- Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, CloudWatch, and observability practices including white and black box monitoring and telemetry collection.
- Strong scripting skills in languages such as Python, Bash, or Go, and proficiency in at least one programming language such as Python, Java/Spring Boot, or .Net.
- Excellent problem-solving skills, attention to detail, and the ability to troubleshoot common networking technologies and issues.
- Strong communication and collaboration skills, with the ability to contribute to large and collaborative teams by presenting information logically and compellingly.
- Proficient in site reliability culture and principles, with familiarity in implementing site reliability within an application or platform.
- Proficient knowledge of software applications and technical processes within a given technical discipline, such as Cloud or artificial intelligence.
- Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform, and familiarity with container orchestration using ECS, Kubernetes, and Docker.
- Ability to proactively recognize roadblocks, demonstrate interest in learning technology that facilitates innovation, and initiate and implement ideas to solve business problems.
Preferred qualifications, capabilities, and skills
- Possession of AWS Certified Solutions Architect or DevOps Engineer certification, demonstrating advanced expertise in AWS services.
- Strong problem-solving skills with the ability to troubleshoot complex CI/CD issues effectively and efficiently.
- Proficiency in Java or scripting languages such as Bash, Node.JS, Shell, and Python, showcasing versatility in programming.
- Experience with microservices architecture and serverless computing, enabling scalable and efficient application development.
- Familiarity with security best practices in cloud environments, ensuring robust and secure infrastructure management.
- Demonstrated leadership skills and experience in mentoring engineers, fostering a collaborative and growth-oriented team environment.