Job Responsibilities
- Implement systems that are highly available, scalable, and self-healing
- Design, manage, and maintain tools to automate operational processes
- Automate security controls, governance processes, and compliance validation
- Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout the team
- Collaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debt
- Works toward becoming an expert on the applications and platforms in your remit while understanding their interdependencies and limitations
- Evolves and debug critical components of applications and platforms
- Provides comprehensive and ongoing guidance, tools, and solutions to support the firms’ growth
- Define and deploy monitoring, metrics, and logging systems on AWS and implement/Enhance infrastructure automation via IaaC using Terraform
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and 5+ years applied experience
- Minimum 10 years of overall experience
- Proficient in scripting languages such as Python, Bash, or PowerShell
- Proficient with DevOps practices and CI/CD pipelines
- Experience with automation tools like Terraform or AWS CloudFormation
- Excellent problem-solving and troubleshooting skills
- Ability to tackle design and functionality problems independently with little to no oversight
- Practical cloud native experience
Preferred qualifications, capabilities, and skills