Document, agree and implement a consequence practice when an error budget is breached
Drive our transformation to “everything as code”
Partner with R&D and Operations teams to enhance telemetry
Work with Architects to design for availability and performance
Work with engineering to provide intelligence on system performance for continuous improvements
Use operational intelligence from observability and telemetry to Auto Remediate for availability
Gather data and use AI to infuse predictive alerts and actions
Design, Automate and run Anti-fragility tests, and conduct Fire drills to ensure resiliency
Work with stakeholders to provide production sizing guidelines for cost based Autoscaling
Create, manage and maintain SLI’s and SLO’s.
Actively participate in and drive operations, rapid emergency response efforts, help with blameless postmortems and follow up on engineering actions
Work with engineering leadership to support incremental and continuous deployments into production
Provide guidance to lower level engineers
Develop and provide input into new operational standards and best practices
Lead and driver participation in process improvement, training & tool development
Minimum Requirements:
Bachelor’s degree in Computer Science, Engineering or other related technical filed or equivalent experience
5 years experience with high-level language, e.g.: GoLang, Python, Java, C#.
5+ years with build, source and editing including make, vi, bash
Demonstrable experience automating build, testing, deployment, alerting, and any other similar work
3 years supporting a multi-region, multi-tenant, SaaS or PaaS environment
3 years experience with AWS (preferred AMI, EC2, EBS, ELB, IAM, KMS, RDS, S3, SNS, VPC, Route 53, CloudWatch, Lambda)
3 years experience with automated delivery tools, e.g. Harness, Jenkins, Azure DevOps
2 years experience with Infrastructure as Code, e.g. Terraform, Helm
3 years experience with Docker, Kubernetes, HELM, YAML
3 years experience with Git and branching strategies and automated config management, e.g. Github, Gitlab, Chef, Ansible
Excellent problem solving skills
2 years experience with observability frameworks (telemetry, log aggregation, APM, synthetic testing), e.g. DataDog, AppDynamics, Splunk, etc.
Successful completion of a background screening process including, but not limited to, employment verifications, criminal search, OFAC, SS Verification, as well as credit and drug screening, where applicable and in accordance with federal and local regulations
Preferred Requirements:
AWS Developer, DevOps, SysOps, or Solution Architecture certifications
Detail oriented and highly organized with the ability to manage multiple priorities and parallel projects
Excellent written and verbal communication skills
Experience with Identity Solutions such as Auth0, Keycloak
Experience with Hashicorp suite including Vault, Consul and Boundary
Demonstrated experience working in agile environments
Experience of implementing and managing Data processing platforms that meet regulatory compliance regulation such as PCI-DSS, HIPAA and GDPR
Excellent organization, time management, and project skills
Previous success operating in a matrix environment
Have successfully lead a DevOps or SRE transformation from a technical perspective