Job responsibilities
- Creates high quality designs, roadmaps, and program charters that are delivered by you or the engineers under your guidance
- Provides advice and mentoring to other engineers and acts as a key resource for technologists seeking advice on technical and business-related issues
- Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team
- Collaborates with others to create and implement observability and reliability designs for complex systems that are robust, stable, and do not incur additional toil or technical debt
- Utilize Infrastructure as code: use Terraform and GitLab CI/CD for automation, containerize our environments (Kubernetes, Helm charts), and leverage cloud technologies to meet our goals
- Expertly manage, configure and troubleshoot operating system issues, storage (block and object), networking (VPCs, proxies and CDNs), and administer high-availability Cockroach, PostgreSQL and Redis clusters
- Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations
- Evolves and debug critical components of applications and platforms
- Provides comprehensive and ongoing guidance, tools, and solutions to support the firms’ growth
- Makes significant contributions to JPMorgan Chase’s site reliability community via internal forums, communities of practice, guilds, and conferences
Required qualifications, capabilities, and skills
- Formal training or certification on site reliability principles and concepts, and advanced experience implementing site reliability within an application or platform
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Proven public or private cloud experience (GCP is our priority))
- Fluency in at least one programming language such as (e.g., Python, Java, Go)
- Extensive Kubernetes operational experience (ideally including Istio, ArgoCD)
- Proficiency in continuous integration and continuous delivery tools e.g., Jenkins, GitHub, Terraform, etc
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Experience with troubleshooting common networking technologies and issues
- Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines
- Ability to communicate data-based solutions with complex reporting and visualization methods
Preferred qualifications, capabilities, and skills
- Recognized as an active contributor of the engineering community