Expoint – all jobs in one place
Finding the best job has never been easier
Limitless High-tech career opportunities - Expoint

JPMorgan ALTS - Lead SRE 
United States, New Jersey, Jersey City 
79258521

29.05.2025

Key Responsibilities:

Team Leadership and Management:

  • Lead, mentor, and develop a team of SREs, fostering a culture of collaboration and continuous improvement
  • Set clear goals and expectations for the team, ensuring alignment with business objectives.
  • Facilitate regular team meetings and one-on-one sessions to support individual growth and team cohesion

Execution and Delivery:

  • Oversee the delivery of major themes of work, ensuring high-quality execution and timely completion
  • Guide the team in estimating delivery timelines and managing workloads effectively
  • Provide expert guidance in debugging and systems design, encouraging innovative solutions and trade-off analysis

Risk Management:

  • Assess cross-impact of team deliverables and ensure proactive communication of potential risks
  • Support the team in identifying technical limitations and suggesting remediation strategies

Strategic Vision and Forward Thinking:

  • Develop and implement strategic plans for building robust systems with strong contracts, anticipating future changes
  • Encourage the team to propose alternative requirements and solutions that better meet organizational needs
  • Set and prioritize the strategic book of work for the team in line to support goals of the business

Communication and Stakeholder Engagement:

  • Communicate effectively with stakeholders, providing updates on progress and raising risks that will impact delivery
  • Ensure the team is aligned with the business vision and understands the importance of their contributions to the product

Qualifications:

  • Experience directly leading or functioning as a lead of technical teams, with a focus on SRE, DevOps, or infrastructure engineering
  • Proficiency in programming languages (Python preferred) and distributed systems (Kubernetes, Kafka, Cassandra, etc.)
  • Experience with setting up and using SLOs to track system health and performance
  • Excellent problem-solving skills and creativity in debugging complex issues
  • Deep understanding of cloud fundamentals and infrastructure management
  • Exceptional communication skills, with the ability to articulate technical problems and solutions to diverse audiences
  • A strategic mindset with a keen interest in automation and learning
  • Having a thorough understanding of the full stack of the system

Our system has been working properly for the past few days in our UAT environment. We deployed a new version of core infrastructure that was tested in dev, we found it to be working & then approved it for UAT release. Suddenly, one of our services is not starting & our product or QA team cannot test changes in this environment. We receive a ping/bug report that provides high level information about what is happening, what the user would like to happen & perhaps information about what they expect to happen. We ask you to take a look at the issue.. Resolving this involves:

  • Asking & communicating with the user to fully understand what the issue is
  • Understanding where in the stack to begin debugging
  • Constantly questioning your assumptions about the way the system should work
  • Being able to ask the right questions to your peers & team to triage an issue
  • Providing updates to stakeholders that are counting on you to identify or fix the problem
  • Using your technical skill set to identify/reproduce the issue
  • Communicating what you have found to the team so that we can best resolve the issue