Finding the best job has never been easier

JPMorgan Lead Site Reliability Engineer
United Kingdom, England, London
837874373

23.11.2024

Job responsibilities

Creates high quality designs, roadmaps, and program charters that are delivered by you or the engineers under your guidance
Provides advice and mentoring to other engineers and acts as a key resource for technologists seeking advice on technical and business-related issues
Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team
Collaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debt
Utilize Infrastructure as code: use Terraform and GitLab CI/CD for automation, containerize our environments (Kubernetes, Helm charts), and leverage cloud technologies to meet our goals
Expertly manage, configure and troubleshoot operating system issues, storage (block and object), networking (VPCs, proxies and CDNs), and administer high-availability Cockroach, PostgreSQL and Redis clusters
Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations
Evolves and debug critical components of applications and platforms
Provides comprehensive and ongoing guidance, tools, and solutions to support the firms’ growth
Makes significant contributions to JPMorgan Chase’s site reliability community via internal forums, communities of practice, guilds, and conferences

Required qualifications, capabilities, and skills

Formal training or certification on site reliability principles and concepts, and advanced experience implementing site reliability within an application or platform
Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
Proven public or private cloud experience (GCP is our priority))
Fluency in at least one programming language such as (e.g., Python, Java, Go)
Extensive Kubernetes operational experience (ideally including Istio, ArgoCD)
Proficiency in continuous integration and continuous delivery tools e.g., Jenkins, GitHub, Terraform, etc
Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
Experience with troubleshooting common networking technologies and issues
Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines
Ability to communicate data-based solutions with complex reporting and visualization methods

Preferred qualifications, capabilities, and skills