Application window has been extended and is expected to close by 06/20/2025. However, the job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.
Note: The successful applicant will be performing work on US Government classified environments, and therefore, must be a U.S. Person (i.e., U.S. citizen, U.S. national, lawful permanent resident, asylee, or refugee). This position may also perform work that the U.S. government has specified can only be performed by a U.S. citizen on U.S. soil.
Your Impact
As a Site Reliability Engineer, you'll be operating the next generation of Cloud Security suite products in our Federal running in an AWS GovCloud-native environment. This charter enables Umbrella to expand its access to the US government and other public sector market opportunities. Our current focus is on the platform's FedRAMP (Federal Risk and Authorization Management Program) authorization.
You will own the production environment from top-to-bottom, from deployments, through observability and incident response. You'll work with customer support, our partners as well as the developments to keep our services running smoothly. This includes responding to alerts, running and improving our playbooks and streamlining our processes to improve efficiency.
As an engineer within Umbrella's FedRAMP SRE, your success in this role is highly reliant on your ability to learn quickly.
Your typical day will vary, but this is what you will be doing:
- Developing and maintaining complex infrastructure
- Building platforms that provide key services to internal and external collaborators
- Designing solutions for scalability and reliability
- Ensuring compliance with security controls
- Build end-to-end documentation and instrumentation of our platform to ensure visibility, automation, self-healing, and resiliency throughout the stack.
- Triaging, solving, and addressing production problems in every layer of the stack
- Collaborate with other engineers on the team to cultivate sound engineering principles and represent our engineering values
Minimum Qualifications:
- 4+ years administrative experience in enterprise-grade AWS infrastructure
- Experience with incident response and release management
- Operational experience with a focus on continuous delivery and deployment (CI/CD) and cloud automation
- 2+ years of Linux systems administration experience, to include troubleshooting
- Kubernetes experience, EKS preferred
Preferred Qualifications:
- Experience to include building, maintaining/troubleshooting and monitoring Kubernetes clusters. Experience with Helm charts is highly desired
- Experience and/or knowledge with the Grafana suite; Experience with Infrastructure-as-Code tools like Terraform, Ansible, CloudFormation, etc.
- Strong Jenkins experience, to include troubleshooting; Debugging application and infrastructure in a cloud environment
- Experience working with end-users and technical support on customer concerns, and collaborating with development teams to resolve issues efficiently.
- Strong security and networking skills working with tools like: wireshark, tcpdump, Nmap
- Scripting skills in Golang, Python, and bash
- Ability to participate in a 24/7 on-call rotation.