Job responsibilities:
- Collaborate with product and engineering teams to deliver robust cloud-based solutions that drive enhanced customer experiences.
- Own end-to-end platform issues, problem management & help provide solutions to platform production issues on the AWS Cloud & ensure the applications are available as expected.
- Guide various product teams on the standards and best practices related to the Public Cloud process and help them mitigate issues in production cloud with minimal downtime.
- Lead a team to Develop, enhance, and maintain established standards and best practices, Drive, self-service, and deliver on a strategy to operate on a build broad use of Amazon's utility computing web services (e.g., AWS EC2, AWS S3, AWS RDS, AWS CloudFront, AWS EFS, CloudWatch, EKS)
- Analyze upcoming platform level changes into production ensure communication of relevant impact.
- Identify opportunities to improve resiliency, availability, secure, high performing platforms in Public Cloud using JPMC best practices. Improve reliability, quality, and reduce to time to resolve issues in production incidents on software applications in prod.
- Implement continuous process improvement, including but not limited to policy, procedures, and production monitoring and reduce time to resolve. Identify, coordinate, and implement initiatives/projects and activities that create efficiencies and optimize technical processing.
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve.
- Provide primary operational support and engineering for the public cloud platform. Show leadership for any production issue and manage all the corresponding team in working towards fix and also should ensure minimal customer impact.
- Debug and optimize systems and automate routine tasks. Collaborate with a cross-functional team to identify potential risks in production and opportunities to improve user experiences at every interaction. Drive work streams to ensure Applications meet strict operational readiness for Public Cloud On-boarding. Evaluate production readiness through game days, resiliency tests and chaos engineering exercises.
- Utilize programming languages like Java, Python, SQL, Node, Go, and Scala, Open Source RDBMS and NoSQL databases, Container Orchestration services including Docker and Kubernetes, and a variety of AWS tools and services
Required qualifications, capabilities, and skills
- Formal training or certification in software engineering concepts and 10+ years applied experience. In addition, 5+ years of experience in building or supporting environments on AWS using Terraform, which includes working with services like EC2, ELB, RDS, and S3
- Strong understanding of business technology drivers and their impact on architecture design, performance and monitoring, best practices. Dynamic individual with excellent communication skills, who can adapt verbiage and style to the audience at hand and deliver critical information in a clear and concise message.
- Strong analytical thinker, with business acumen and the ability to assimilate information quickly, with a solution-based focus on incident and problem management.
- Expertise using DevOps tools in a cloud environment, such as Ansible, Artifactory, Docker, GitHub, Jenkins. Expertise using monitoring solutions like CloudWatch, Prometheus, Datadog. Experience/Knowledge of writing Infrastructure-as-Code (IaC), using tools like CloudFormation or Terraform
- Experience with one or more public cloud platforms like AWS, GCP, Azure . Experience with one or more automation tools like Terraform, Puppet, Ansible
- Experience with high volume, mission critical applications and their interdependencies with other applications and databases
- Ability to leverage Splunk and Dynatrace to identify and troubleshoot issues. Experience of ITIL process such as incident, problem, and life cycle management. Experience with high volume, mission critical applications, and building upon messaging and or event-driven architectures.
- Knowledge of container platforms such as Docker and Kubernetes. Strong understanding of architecture, design, and business processes. Keen understanding of financial and budget management, control and optimization of Public Cloud expenses
- Experience in working in in large, collaborative teams to achieve organizational goals. Passionate about building an innovative culture.
- Experience with production/non-production support of highly available applications. Experience with system performance monitoring and operational capacity management
- Strong communication and collaboration skills
Preferred qualifications, capabilities and skills
- Bachelor’s degree in computer science or other technical, scientific discipline
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
- AWS Certification.
- SRE mindset Culture/Approaches: To run better production systems by creating engineering solutions to operational problems.
- Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript
- Ansible and other dev ops tools is added advantage.