Own the production infrastructure over AWS and Azure. Implement sustainable and scalable solutions with goals of improving availability, and performance
Help Identify root causes for every incident and prevent incidents from ever happening again
Have alerts on symptoms and not on outages. Ensure all infrastructure and application alerts are “actionable” alerts and/or self-healing automation
Work closely with the R&D and Support: offering education and guidance on integration, support, and monitoring across the toolset
Everything as a code approach: Run our infrastructure with Ansible, Terraform, and Kubernetes
Document every action and turn it into repeatable actions and then into automation
Focus on the system's observability, availability, reliability, performance/latency, monitoring
Conduct periodic on-call duties and emergency response
Requirements:
At least 3+ years of experience as DevOps or SRE in a SaaS environment
Experience with Coding languages - Python/JavaScript/Bash, or similar
At least 3+ years of experience with Alerting & Monitoring systems such as DataDog Splunk / New Relic / Prometheus, or similar
Experience working with Linux systems from kernel to shell and beyond
Cloud systems such as AWS / Google cloud / Azure
Configuration management such as Ansible /Chef/Puppet
Experience with Docker, Kubernetes and Helm
SCM - Git/bitbucket/ gitlab /Phabricator/gerrit
High Analytical & Troubleshooting skills - ability to solve complex problems
Strong verbal and written communication skills and a collaborative mindset
Ability to dive into detail while understanding the big picture