● Automate as much as reasonable to significantly improve operational efficiency of the Lacework platform
● Design, build and improve our infrastructure to enhance service scalability, resiliency, and efficiency across the company.
● Identify mission-critical problems and solve them via automation, tooling, communication, and informed design.
● Build and improve monitoring and instrumentation to predict future scalability or failure risks and solve them before they manifest into customer-facing issues.
● Facilitate company-wide visibility into key metrics, SLAs, and milestones so that scale and resiliency are a part of every conversation.
● Develop best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes.
● Participate in an on-call rotation.
Minimum Qualifications:
● 3 years of SRE experience with production systems (depending on level)
● Strong development and automation skills.
● Extensive experience with Infrastructure as Code (Terraform, etc), as well as supporting tooling (Atlantis, ArgoCD, etc)
● Extensive experience with Kubernetes and supporting tooling (Helm, operators, etc)
● Extensive experience with a variety of cloud managed services and providers
○ AWS: EKS, EC2, S3, RDS, Secrets Manager, etc.
● Experience building production quality cloud infrastructure that enables reliable and rapid deployment of microservices with effective monitoring and built in high availability and/or fault tolerance.
● Strong passion for using automation to create simple repeatable dev and ops patterns that ensures a stable, reliable experience for customers.
● Strong cross-team communication skills.
● Experience with the building blocks of large-scale systems including load balancing, distributed/cloud computing, containers, instrumentation, and monitoring.
● Knowledge of cloud networking, including VPC configuration and cross-cloud connectivity.
● Familiarity with one or more programming languages (Python, Golang, etc.).
Preferred Qualifications:
● Experience with monitoring and observability systems and tools (Prometheus, Grafana, New Relic, DataDog, etc.)
● Believe everything should be "as code"
● Experience in Systems, Operations, or Full-Stack Development is a major bonus
● Experience with Java application servers and JVM configuration