Job responsibilities
- Champion a culture of site reliability, exerting technical influence throughout your team and the organization.
- Lead initiatives to improve service levels using data-driven analytics, enhancing the reliability and stability of applications and platforms.
- Collaborate with team members to identify comprehensive service level indicators and work with stakeholders to establish service level objectives and error budgets.
- Demonstrate high-level expertise in AWS, distributed systems, and data warehouse domains, proactively resolving technology-related bottlenecks.
- Act as the primary point of contact during major incidents, showcasing the ability to quickly identify and resolve issues to prevent financial losses.
- Document and share knowledge within the organization through internal forums and communities of practice.
Required Qualifications, Capabilities, and Skills:
- Formal training or certification in site reliability engineering concepts with 5+ years of applied experience.
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices.
- Proficiency in at least one programming language such as Python, Java, C, .Net, etc.
- Extensive knowledge of software applications and technical processes, with emerging expertise in one or more technical disciplines.
- Proficiency in observability, including white and black box monitoring, SLO alerting, and telemetry collection using tools like Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Experience with continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.).
- Experience with cloud computing using AWS (EC2, EMR, Athena, Glue, Redshift, etc.) and container orchestration (e.g., ECS, Kubernetes, Docker, etc.).
- Experience troubleshooting common networking technologies and issues.
Preferred Qualifications, Capabilities, and Skills:
- Ability to identify and solve problems related to complex data structures and algorithms.
- Self-motivated and a lifelong learner, eager to embrace and master emerging technologies.
- Ability to expand and collaborate across different levels and stakeholder groups.