Job responsibilities
- Guides and assists others in the areas of building appropriate level designs and gaining consensus from peers where appropriate
- Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team
- Leads initiatives to improve the reliability and stability of web Hosting platforms using data-driven analytics to improve service levels
- Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers
- Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise
- Collaborates with technical experts, key stakeholders, and team members to resolve complex problems
- Provides comprehensive and ongoing guidance, tools, and solutions to support the firms’ growth
- Works toward becoming an expert on the applications and platforms under your influence while understanding their interdependencies and limitations
- Documents and shares knowledge within your organization via internal forums and communities of practice
Required qualifications, capabilities, and skills
- Formal training or certification on site reliability engineering concepts and 3+ years applied experience.
- AWS Exposure (Understanding and working experience in AWS applications, and understanding of resiliency, scalability, observability, monitoring etc,)
- Experience in provisioning AWS infrastructure through Terraform
- Experience as SRE in complex and mission critical applications involving multitude of components of varying technical generations
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Advanced knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platform
- Advanced knowledge and experience in observability, monitoring, alerting, and telemetry collection using tools such as Cloudwatch, Grafana, Dynatrace, Prometheus, Splunk, etc.
- Fluency in at least one programming language such as (e.g., Python, Terraform, Ansible, Java Spring Boot, Shell Scripting, .Net, etc.)
- Strong communication skills with ability to mentor and educate others on site reliability principles and practices
- Deep knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
- Drive to self-educate and evaluate new technology
Preferred qualifications, capabilities, and skills
- Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team
- Ability to initiate and implement ideas to solve business problems