Implement Enhanced Testing and Recovery:
- Oversee the implementation and execution of Production Swing testing for critical applications, ensuring applications run from their alternate site for a minimum of 5 days.
- Implement and oversee Data Recovery testing, ensuring applications can recover critical data from backup solutions within the defined Impact Tolerance (ITOL).
- Drive the onboarding of critical applications to the One-Touch Recovery orchestration solution.
- Minimize the Recovery Time Actual (TRTA) for critical applications.
Design and Architecture:
- Champion resilient application design by advocating for and integrating resiliency principles into architectures, and promoting the use of established resiliency patterns.
- Leverage cloud-native services and features to enhance application resiliency. This includes services for auto-scaling, load balancing, and disaster recovery.
- Explore and implement chaos engineering practices to proactively identify and address system weaknesses under stress.
Proactive Vulnerability Management:
- Proactively identify vulnerabilities through regular architecture reviews, comprehensive scenario testing, and foundational testing.
- Document and demonstrate mitigation efforts for all discovered vulnerabilities. This includes developing remediation plans, implementing necessary changes, and validating the effectiveness of mitigations.
- Ensure that all identified vulnerabilities have remediation plans scheduled.
Operational Resilience Adherence:
- Ensure that all critical applications adhere to operational resilience testing and recovery requirements.
- Collaborate with relevant stakeholders to define and maintain appropriate impact tolerances for critical business services.
Performance Measurement and Reporting:
- Monitor and report on key resilience metrics, including the number of applications executing production swing tests, the number of applications on One-Touch Recovery, recovery times and adherence to operational resilience requirements.
- Provide regular updates to senior management on the status of resilience initiatives and key performance indicators.
Key Qualifications:
- 7+ years of professional software engineering experience
- 4+ years of experience in SRE roles
- Expertise analyzing complex application, database, network, and OS issues across a distributed large scale customer facing systems
- Strong communication skills and ability to work effectively across multiple business and technical team
- Experience in Java, .NET, Maven, Gradle, Jenkins, Helm, Puppet, Chef, Ansible, Kubernetes, AWS, Splunk, Prometheus
- BS degree in computer science or equivalent field
Applications SupportFull timeNew York New York United States$170,000.00 - $300,000.00
Anticipated Posting Close Date:
Jul 01, 2025View Citi’s and the poster.