Public Cloud SRE is responsible for engineering and operating the cloud infrastructure and platforms of JPMC ensuring reliability, resiliency, and security. We have a Senior Software Engineer, Site Reliability position to build the infrastructure and tooling for JPMC’s Public Cloud Platform.
Job responsibilities
- Engage in and improve the lifecycle of cloud services from inception, design, deployment, and operation
- Automate repeated manual tasks, develop tools and automation to improve the efficiency of the platform and infrastructure.
- Analyze defects, propose improvements and drive efficiencies in systems and processes.
- Helps to develop new cloud engineering strategies and implementations for the firm
- As part of Site Reliability, you have the responsibility of ensuring the reliability, availability, and performance of the cloud infrastructure and platform.
- Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your team
- Develop observability and telemetry tools.
- Author and improve the quality of technical engineering documentation
- Debug and solve issues in a production environment
- Participates in SRE on-call rotations and escalation workflows.
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering or site reliability engineering and 5+ years applied experience
- Bachelor’s Degree in Computer Science or equivalent
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Expertise in building solutions with AWS cloud service, knowledge in Infrastructure as Code, tools such as Terraform and fluency in at least one programming language such as Python and Java
- Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.) and troubleshooting common networking technologies and issues
- Ability to identify and solve problems related to complex data structures and algorithms
- Drive to self-educate and evaluate new technology and ability to teach team members
- Ability to expand and collaborate across different levels and stakeholder groups. Excellent communication skills working with stakeholders and domain experts across the company to design solutions to user problems
- Self-disciplined, self-managed, self-motivated and strong sense of ownership, urgency and drive
Preferred qualifications, capabilities, and skills
- AWS certifications will be a bonus.