On a typical day in this role, you will interact with Kubernetes, Docker, Helm, Elasticsearch, DataDog, Grafana, Sensu, Puppet, Ansible/AWX, AWS, Azure, Python/Bash/PowerShell, Terraform/Terragrunt. If you don’t know all these tools, don’t worry, we are not expecting that you know them all, we understand that technology evolves quickly.
Major Responsibilities:
- Scale systems sustainably through mechanisms like automation
- Ownership of monitoring system
- Maintain services in production by measuring and monitoring availability, latency, and overall system health.
- Application expansion and horizontal scaling.
- Work closely with developers, support and QA teams on maintaining and improving the whole lifecycle of services.
- Practice sustainable incident response and blameless post-mortems.
- Provide primary operational support and engineering for multiple large distributed software applications.
Required Technical and Professional Expertise
- Familiarity with Site-Reliability Engineering
- The ability to thrive in Autonomy
- Knowledge of configuration management tools (e.g. Ansible or Puppet)
- Experience with any scripting language (Bash, Python, PowerShell, etc.)
- Experience with containerization (e.g., Docker, Podman, etc.)
- Experience with container orchestration tools (e.g., Kubernetes, Open Shift, Docker Swarm, etc.)
- Experience with database administration and management (MS SQL Server, PostgreSQL, MongoDB)
- Familiarity with public cloud providers such as AWS, Azure, or IBM Cloud
- Experience with monitoring, observability & logging (e.g., DataDog, Prometheus, Grafana, ELK stack, Loki, etc.)
- Familiarity with RESTful systems and their APIs
- Experience with high-level programming languages (Golang, .Net, Java, etc.) is a plus
- Mentoring peers and sharing skills
Preferred Technical and Professional Expertise
- Ability to thrive in autonomy
- Experience in a large-scale, distributed Linux/Unix or Windows is a plus
- Mentoring peers and sharing skills
- Great communication skills