Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

JPMorgan Lead Site Reliability Engineer 
United Kingdom, England, London 
837874373

Yesterday

Job responsibilities

  • Creates high quality designs, roadmaps, and program charters that are delivered by you or the engineers under your guidance
  • Provides advice and mentoring to other engineers and acts as a key resource for technologists seeking advice on technical and business-related issues
  • Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team
  • Collaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debt
  • Utilize Infrastructure as code: use Terraform and GitLab CI/CD for automation, containerize our environments (Kubernetes, Helm charts), and leverage cloud technologies to meet our goals
  • Expertly manage, configure and troubleshoot operating system issues, storage (block and object), networking (VPCs, proxies and CDNs), and administer high-availability Cockroach, PostgreSQL and Redis clusters
  • Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations
  • Evolves and debug critical components of applications and platforms
  • Provides comprehensive and ongoing guidance, tools, and solutions to support the firms’ growth
  • Makes significant contributions to JPMorgan Chase’s site reliability community via internal forums, communities of practice, guilds, and conferences

Required qualifications, capabilities, and skills

  • Formal training or certification on site reliability principles and concepts, and advanced experience implementing site reliability within an application or platform
  • Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
  • Proven public or private cloud experience (GCP is our priority))
  • Fluency in at least one programming language such as (e.g., Python, Java, Go)
  • Extensive Kubernetes operational experience (ideally including Istio, ArgoCD)
  • Proficiency in continuous integration and continuous delivery tools e.g., Jenkins, GitHub, Terraform, etc
  • Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
  • Experience with troubleshooting common networking technologies and issues
  • Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
  • Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines
  • Ability to communicate data-based solutions with complex reporting and visualization methods

Preferred qualifications, capabilities, and skills

  • Recognized as an active contributor of the engineering community