Expoint – all jobs in one place
המקום בו המומחים והחברות הטובות ביותר נפגשים
Limitless High-tech career opportunities - Expoint

JPMorgan Lead Site Reliability Engineer 
United States, New Jersey, Jersey City 
765662642

Today

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.

Job responsibilities

  • Advocate and embody site reliability principles, fostering a culture of excellence and technical influence within your team.
  • Leverage AI tools to enhance operational effectiveness and automate processes, ensuring high-quality customer service.
  • Spearhead projects aimed at enhancing the reliability and stability of applications and platforms.
  • Utilize data-driven analytics and AI technologies to automate detection, diagnosis, resolution processes, elevate service levels and drive continuous improvement.
  • Engage stakeholders to establish realistic service level objectives and error budgets, ensuring alignment with customer expectations.
  • Exhibit advanced technical proficiency in one or more domains, proactively addressing technology-related bottlenecks.
  • Employ AI-driven solutions to streamline processes and enhance operational efficiency.
  • Serve as the primary contact during major incidents, demonstrating the ability to swiftly identify and resolve issues to prevent financial losses.
  • Act as a culture carrier by documenting and disseminating knowledge through internal forums and communities of practice.
  • Mentor team members, guiding them in the strategic adoption of AI technologies to enhance operational effectiveness and customer service.

Required qualifications, capabilities, and skills

  • Formal training or certification on site reliability engineering concepts and 5+ years applied experience.
  • Proven success in an SRE or senior DevOps role, with deep knowledge of service level indicators/objectives (SLIs/SLOs), incident management, postmortem analysis, and systems reliability.
  • Expert with observability stacks (e.g. Datadog/Dynatrace, Prometheus, Grafana, Splunk, OpenTelemetry), including deep experience correlating telemetry across services and time.
  • Hands-on skills in coding (at least one high-level programming language), cloud platforms (AWS or GCP), container orchestration (Kubernetes), infrastructure as code (Terraform), and resilient CI/CD pipelines.
  • Active experience or deep curiosity in applying AI to operations—such as LLM-based copilots, anomaly detection, automated runbooks, autonomous agents.
  • A track record of delivering under pressure. You finish what you start, adapt to uncertainty, and thrive in high-accountability environments.
  • You deconstruct complexity, organize effectively, and drive clarity into ambiguous operational environments. Documentation and design are second nature.
  • Outstanding communication, empathy, and professionalism—especially during incidents. You recognize that great systems serve real people.

Preferred qualifications, capabilities, and skills

  • Experience with operational and compliance rigor in banking, fintech, or similar.
  • Manage and optimize various types of databases, including relational, NoSQL databases.
  • Experience with game days, chaos experiments, or failure-mode analysis to improve service robustness.
  • A background in mentoring engineers or leading technical knowledge-sharing, especially around AI and SRE best practices.
  • Ability to initiate and implement ideas to solve business problems
  • Strong communicator with excellent problem-solving, critical thinking, and analytical reasoning skills, along with attention to detail and a passion for innovation.