Job Category
Software Engineering
Job Details
Key Responsibilities
- System Reliability & Uptime: Maintain and improve service reliability, availability, and performance across distributed systems and applications.
- Monitoring & Alerting: Design, build, and maintain comprehensive monitoring, logging, and alerting systems to detect and address issues proactively.
- Incident Management: Respond to production incidents, perform root cause analysis, and implement preventative measures.
- Automation & Tooling: Automate repetitive tasks using scripts, configuration management, andinfrastructure-as-code(IaC) tools to improve efficiency and consistency.
- Capacity Planning: Monitor usage trends, forecast growth, and scale systems to meet future demands while controlling costs.
- Deployment & CI/CD: Maintain and improve continuous integration and continuous delivery pipelines; ensure safe and frequent deployments.
- Security & Compliance: Collaborate with security teams to ensure systems adhere to best practices and compliance requirements and participate in security assessments for onboarding and maintaining FedRAMP services.
- Collaboration: Work closely with development teams to design resilient and scalable systems; participate in architectural decisions.
- Documentation: Create and maintain clear, detailed documentation for runbooks, systems, and processes.
Requirements
- 8+ years experience in a SRE role or related field (DevOps, Production Operations etc)
- Experience in Public Cloud environments, specifically with AWS
- Experience with New Relic, collectd, Splunk, Sumo Logic, Grafana, Terraform, Jenkins, Kubernetes, Spinnaker or related tools
- Excellent knowledge of Internet technologies and protocols (TCP/IP, DNS, HTTP, SSL, etc.)
- Strong experience with API fundamentals (SOAP, REST, RAML or OAS)
- Ability to root cause sources of instability in high-traffic, large-scale distributed systems
- Solid knowledge of large-scale complex systems from a reliability perspective
- Passion for resolving reliability issues and identifying strategies to mitigate repeat issues.
- Experience with development in Python, Go, Bash, or related.
- Experience with FedRAMP environments.
- A related technical degree required.
This candidate must be a U.S. citizen (U.S. born or naturalized) operating on U.S. Soil who does not hold dual citizenship with the ability to meet customer and government screening standards applicable to this role, including a Criminal Justice Information Services screening with fingerprint scan. Due to the citizenship requirements for this role, which supports U.S. federal, state, and/or local government customers, citizenship will be verified through two of the following REAL ID Act documents: U.S. Passport, Passport Card, REAL Driver’s License, Global Entry Card, U.S. Government CAC/PIV. You agree to complete a Minimum Background Investigation (MBI) for a Moderate Public Trust position with the U.S. federal government and gain other clearances as deemed appropriate for the role
If you require assistance due to a disability applying for open positions please submit a request via this.
Posting Statement