Job Responsibilities
- Guide and assist in designing and deploying new cloud services, gaining consensus from peers.
- Design and implement automated CI/CD pipelines for software teams.
- Collaborate with software engineers to develop solutions for application availability and scalability.
- Write and deploy infrastructure as code for supported applications and platforms.
- Resolve complex technical problems with technical experts and stakeholders.
- Utilize service level objectives to proactively resolve issues before customer impact.
- Support the adoption of site reliability engineering best practices within the team.
Required Qualifications, Capabilities, and Skills
- Formal training or certification in site reliability engineering concepts with 5+ years of applied experience.
- Understanding of site reliability culture and principles, and implementation within applications or platforms.
- Domain knowledge of software applications and technical processes in the AWS ecosystem.
- Experience with infrastructure as code tools like Terraform and CloudFormation.
- Experience with container orchestration such as ECS, Kubernetes, and Docker.
- Knowledge of CI/CD tools like Jenkins, GitLab, or GitHub Actions.
- Proficiency in Python, Java/Spring, or Ruby.
- Hands-on knowledge of Linux and networking internals.
Preferred Qualifications, Capabilities, and Skills
- Familiarity with observability concepts and tools like Datadog, OpenTelemetry, Grafana, Prometheus, and Splunk.
- Understanding of message bus platforms such as Kafka or Kinesis.
- Ability to troubleshoot networking technologies and issues.
- Proactive in recognizing roadblocks and learning new technologies.
- Ability to identify new technologies and solutions to meet design constraints.