Implement Enhanced Testing and Recovery:
Oversee the implementation and execution of Production Swing testing for critical applications, ensuring applications run from their alternate site for a minimum of 5 days.
Implement and oversee Data Recovery testing, ensuring applications can recover critical data from backup solutions within the defined Impact Tolerance (ITOL).
Drive the onboarding of critical applications to the One-Touch Recovery orchestration solution.
Minimize the Recovery Time Actual (TRTA) for critical applications.
Design and Architecture:
Champion resilient application design by advocating for and integrating resiliency principles into architectures, and promoting the use of established resiliency patterns.
Leverage cloud-native services and features to enhance application resiliency. This includes services for auto-scaling, load balancing, and disaster recovery.
Explore and implement chaos engineering practices to proactively identify and address system weaknesses under stress.
Proactive Vulnerability Management:
Proactively identify vulnerabilities through regular architecture reviews, comprehensive scenario testing, and foundational testing.
Document and demonstrate mitigation efforts for all discovered vulnerabilities. This includes developing remediation plans, implementing necessary changes, and validating the effectiveness of mitigations.
Ensure that all identified vulnerabilities have remediation plans scheduled.
Operational Resilience Adherence:
Ensure that all critical applications adhere to operational resilience testing and recovery requirements.
Collaborate with relevant stakeholders to define and maintain appropriate impact tolerances for critical business services.
Performance Measurement and Reporting:
Monitor and report on key resilience metrics, including the number of applications executing production swing tests, the number of applications on One-ouch Recovery, recovery times and adherence to operational resilience requirements.
Provide regular updates to senior management on the status of resilience initiatives and key performance indicators.
Key Qualifications:
Relevant professional software engineering experience - and in particular in SRE roles
Expertise analyzing complex application, database, network, and OS issues across a distributed large scale customer facing systems
Strong communication skills and ability to work effectively across multiple business and technical team
Experience in Java, .NET, Maven, Gradle, Jenkins, Helm, Puppet, Chef, Ansible, Kubernetes, AWS, Splunk, Prometheus
BS degree in computer science or equivalent field
What we’ll provide you:
By joining Citi, you will not only be part of a business casual workplace with a hybrid working model (up to 2 days working at home per week), but also receive a competitive base salary (which is annually reviewed), and enjoy a whole host of additional benefits such as:
27 days annual leave (plus bank holidays)
A discretional annual performance related bonus
Private Medical Care & Life Insurance
Employee Assistance Program
Pension Plan
Paid Parental Leave
Special discounts for employees, family, and friends
Access to an array of learning and development resources
Time Type:
משרות נוספות שיכולות לעניין אותך