About The Role
What You Will Do
● Design and implement visibility into our platform as we grow to multi-region scale.
● Design, deploy, and maintain cloud native monitoring services in AWS and GCP that are elastic and resilient to failure.
● Provide standards and best practices for instrumentation of container based services and cloud managed services.
● Maintain our alerting pipeline so that we are notified of the right things, at the right time, in the right places.
● Drive automation wherever possible, enabling our monitoring platforms to scale effortlessly. Think self service.
● Participate in and contribute to improve our 24x7 incident response and on-call rotation.
● Strong Infrastructure as Code skills, ideally with Terraform and Kubernetes.
● Strong knowledge of modern logging tool sets, including Logstash or Fluentd.
● Understanding of Prometheus and it’s ecosystem, including Alertmanager.
● Good knowledge of Application Performance Monitoring tools and crash reporting tools, such as Sentry.
● Good knowledge of cloud provider managed services, and how they can be leveraged in our context.
● Ability to write high quality code in Python, Go, or equivalent languages.