You are responsible for taking the lead and conducting resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
Job responsibilities
- Participate in the follow the sun application support team managing production technology incidents to resolution, ensuring timely engagement, escalation and effective communication to business, technology and vendor partners
- Act as a main point of contact during major incidents for the application and demonstrates the skills to identify and solve issues quickly to avoid financial losses
- Lead initiatives to improve the reliability and stability of the team’s applications and platforms using data-driven analytics to improve service levels
- Collaboration with SRE team members, developers and business stakeholders to identify comprehensive service level indicators then establish the corresponding service level objectives and error budgets.
- Presentation site reliability culture and practices and exerts technical influence throughout your team
- Use Cloud experience to guide the wider SRE and Developers teams as the platform transitions from on-premises processes to Cloud based solutions.
- Demonstrate a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise
- Document and share knowledge within your organization via internal forums and communities of practice
Required qualifications, capabilities, and skills
- Formal training or certification on reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices concepts and proficient advanced experience
- Solid real-world production expertise in Cloud technologies especially AWS
- Fluency in at least one programming language such as (ideally Python, but similar is acceptable e.g. Java Spring Boot, .Net, etc.)
- Deep knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
- Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Ability to identify and solve problems related to complex data and algorithms
- Drive to self-educate and evaluate new technology
- Ability to teach new programming languages to team members
- Ability to expand and collaborate across different levels and stakeholder groups
Preferred qualifications, capabilities, and skills
- Financial services industry expertise
- Experience of Python or similar scripting languages
- Experience with AWS or other Cloud technology stacks