Expoint - all jobs in one place

מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר

Limitless High-tech career opportunities - Expoint

JPMorgan Lead Site Reliability Engineer 
United States, Texas, Plano 
370462376

Yesterday

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.

Job responsibilities

  • Manage incident response to swiftly mitigate business impacts by coordinating cross-functional teams.
  • Serve as the primary point of contact during major incidents, demonstrating the ability to quickly identify and resolve issues to prevent financial losses.
  • Participate in 24x7 support coverage as required.
  • Oversee, track, and validate all changes to the Production and Disaster Recovery environments.
  • Lead initiatives to enhance the reliability and stability of team applications and platforms, utilizing data-driven analytics to improve service levels.
  • Document and share knowledge within the organization through internal forums and communities of practice.
  • Collaborate with team members to identify comprehensive service level indicators and work with stakeholders to establish reasonable service level objectives and error budgets with customers.
  • Provide ongoing guidance, tools, and solutions to support the firm's growth.
  • Champion and demonstrate site reliability culture and practices, exerting technical influence throughout the team.
  • Exhibit a high level of technical expertise in one or more domains, proactively identifying and resolving technology-related bottlenecks.
  • Strive to become an expert on the applications and platforms under your purview, understanding their interdependencies and limitations.

Required qualifications, capabilities, and skills

  • Formal training or certification on software engineering concepts and 5+ years of applied experience
  • Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
  • Fluency in at least one programming language such as (e.g., Python, Java Spring Boot, .Net, etc.)
  • Deep knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
  • Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
  • Proficiency and experience in Cloud Platform (AWS) infrastructure and setting up monitoring / observability for application migrated to cloud platforms.
  • Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
  • Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
  • Experience with troubleshooting common networking technologies and issues
  • Ability to identify and solve problems related to complex data structures, algorithms and new technologies and if needed self-educate on new technology
  • Ability to expand and collaborate across different levels and stakeholder groups
Preferred qualifications, capabilities, and skills
  • Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team
  • Ability to initiate and implement ideas to solve business problem
  • Experience building dashboards with products such as Grafana
  • Prior experience in both Systems Engineering and Software Development