Job responsibilities
- Drive, support, and deliver on a strategy to operate on a build broad use of Amazon's utility computing web services (e.g., AWS EC2, AWS S3, AWS RDS, AWS CloudFront, AWS EFS, CloudWatch, EKS)
- Identify opportunities to improve resiliency, availability, secure, high performing platforms in Public Cloud using JPMC best practices
- Improve reliability, quality, and reduce to time to resolve issues in production incidents on software applications in prod
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for the public cloud platform
- Debug and optimize systems and automate routine tasks.
- Collaborate with a cross-functional team to identify potential risks in production and opportunities to improve user experiences at every interaction.
- Drive work streams to ensure Applications meet strict non-functional requirements for Public Cloud On-boarding
- Evaluate production readiness through game days, resiliency tests and chaos engineering exercises.
- Utilize programming languages like Java, Python, SQL, Node, Go, and Scala, Open Source RDBMS and NoSQL databases, Container Orchestration services including Docker and Kubernetes, and a variety of AWS tools and services
- Monitor metrics and program health, anticipate and clear blockers, manage escalations
Required qualifications, capabilities, and skills
- Formal training or certification on Infrastructure engineering concepts and 3+ years applied experience
- 6+ years experience across the SDLC process – Design and/or Development and/or support
- 3-5 years experience/knowledge building or supporting web environments on AWS, which includes working with services like EC2, ELB, RDS, and S3
- Experience using DevOps tools in a cloud environment, such as Ansible, Artifactory, Docker, GitHub, Jenkins, Kubernetes, Maven, and Sonar Qube
- Experience/Knowledge using monitoring solutions like CloudWatch, Prometheus, Datadog as well as writing Infrastructure-as-Code (IaC), using tools like CloudFormation or Terraform
- Experience with one or more public cloud platforms like AWS, GCP, Azure as well as one or more automation tools like Terraform, Puppet, Ansible
- Ability to leverage Splunk and Dynatrace to identify and troubleshoot issues.
- Experience of ITIL process such as incident, problem, and life cycle management
- Experience with high volume, mission critical applications, and building upon messaging and or event-driven architectures.
- Knowledge of container platforms such as Docker and Kubernetes.
- Keen understanding of financial and budget management, control and optimization of Public Cloud expenses
Preferred qualifications, capabilities, and skills
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
- SRE mindset Culture/Approaches: To run better production systems by creating engineering solutions to operational problems.
- Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript
- Ansible and other dev ops tools is added advantage.
- AWS Certification.