As a Site Reliability Engineer (SRE) for our large and regionally distributed SaaS platform, your primary responsibilities will be to improve the reliability and availability of our mission-critical cloud-based services.
Essential Duties and Responsibilities:
- Observability and Monitoring:
- Create new dashboards and metrics to provide comprehensive observability into the health and performance of development teams' applications, including SLI/SLO metrics.
- Work with development teams to ensure proper monitoring is set up and enabled for their services.
- Identify evolutionary improvements to the observability and monitoring solutions.
- Reliability Consulting and Automation:
- Consult with development teams on SRE services and best practices to help them improve the reliability of their applications.
- Create automation and tooling to reduce toil and manual intervention.
- Incident and Problem Management:
- Assist other teams in data and performance analysis to identify the root causes of issues and recommend automation actions.
- Knowledge Sharing and Mentoring:
- Review the work of other SREs and provide training and guidance to help them improve their skills.
- Communicate effectively with both technical and non-technical peers and customers.
- Process and Documentation:
- Follow established processes when performing work or help document and create processes, as necessary.
- Document troubleshooting steps and results in appropriate locations for historical access.
- Ensure compliance with policies, procedures, and standards.
- Implement or coordinate remediation required by audits and assessments, and document, as necessary.
- Time Estimation:
- Estimate the time required to complete activities and projects.
Have you got what it takes?
- 6+ years programming/scripting experience with any of the following: (Go, Python, .Net (C#), Node)
- 6+ years of experience working within public or private cloud environments
- 6+ years of SRE/DevOps/Observability or related experience
- 6+ years of AWS
- Experience with Agile, Jira, GitHub, monitoring, automation, dashboarding
, AWS, Azure, DevOps experience.
Tech Manager
Individual Contributor