Our team develops and supports the infrastructure layers spanning our cloud accounts, network/connectivity, workload management, observability, and storage services. We build tooling to perform automated operations in order to scale the Lacework infrastructure and service. To be successful you will design, define, develop, deploy and operate internal tooling, APIs, and frameworks which streamline our workflows and automate our infrastructure.
The Role:
- Automate as much as reasonable to significantly improve operational efficiency of the Lacework platform
- Design, build and improve our infrastructure to enhance service scalability, resiliency, and efficiency across the company.
- Identify mission-critical problems and solve them via automation, tooling, communication, and informed design.
- Build and improve monitoring and instrumentation to predict future scalability or failure risks and solve them before they manifest into customer-facing issues.
- Facilitate company-wide visibility into key metrics, SLAs, and milestones so that scale and resiliency are a part of every conversation.
- Develop best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes.
- Participate in an on-call rotation.
Minimum Qualifications:
- 3 years of SRE experience with production systems (depending on level)
- Strong development and automation skills.
- Extensive experience with Infrastructure as Code (Terraform, etc), as well as supporting tooling (Atlantis, ArgoCD, etc)
- Extensive experience with Kubernetes and supporting tooling (Helm, operators, etc)
- Extensive experience with a variety of cloud managed services and providers
- AWS: EKS, EC2, S3, RDS, Secrets Manager, etc.
- Experience building production quality cloud infrastructure that enables reliable and rapid deployment of microservices with effective monitoring and built in high availability and/or fault tolerance.
- Strong passion for using automation to create simple repeatable dev and ops patterns that ensures a stable, reliable experience for customers.
- Strong cross-team communication skills.
- Experience with the building blocks of large-scale systems including load balancing, distributed/cloud computing, containers, instrumentation, and monitoring.
- Knowledge of cloud networking, including VPC configuration and cross-cloud connectivity.
- Familiarity with one or more programming languages (Python, Golang, etc.).
Preferred Qualifications:
- Experience with monitoring and observability systems and tools (Prometheus, Grafana, New Relic, DataDog, etc.)
- Believe everything should be "as code"
- Experience in Systems, Operations, or Full-Stack Development is a major bonus
Experience with Java application servers and JVM configuration.
Wage ranges are based on various factors including the labor market, job type, and job level. Exact salary offers will be determined by factors such as the candidate's subject knowledge, skill level, qualifications, experience, and geographic location.