Expert in one of the following: Automation, Monitoring tools, Cloud Operations
Solid AWS experience
Solid design and coding skills: 8+ years of strong software engineering experience with Java/J2EE, Go, Python, REST, SOAP, JSON
Demonstrate a deep understanding of deployment architecture and consistently design for low MTTD/MTTR & fast recovery/self healing
Skilled in software development lifecycle processes. Experience with SCRUM and Agile Development
Knowledge of current trends and best practices in the modern SaaS technology landscape
Experience in leveraging Amazon Web Services for building scalable applications
Strong mentoring skills. Able to influence and communicate effectively with both technical and non-technical people
High adaptability and flexibility
Excellent communication skills. Communicates clearly, succinctly, and persuasively to all levels of employees, customers and management (including executives)
Experience in making successful trade-offs that balance the short- and long-term product goals
High energy, self-starter with a positive mindset and with a "can do" attitude
Be comfortable working in complex production environments and seek out ways to drive ambiguity down. You seek to understand before making changes, and actively work to facilitate communication to better understand other approaches to problem
Experience with large scaled payment systems
Technical Leadership:
Acts as a mentor to junior and mid-level engineers by providing guidance on best practices, architecture decisions, and career development, fostering a culture of continuous learning and improvement within the team
Leads the strategic direction for reliability engineering, ensuring alignment with business goals and technological advancements. Establishes a clear vision for the team, setting measurable objectives and tracking progress to ensure the delivery of high-quality, stable, and scalable systems
Coordinates with product managers, developers, operations teams, and other stakeholders to align on priorities, share insights, and ensure cohesive delivery of solutions that meet both business and technical needs. Facilitates communication across teams to address dependencies and ensure seamless integration of reliability practices into everyday workflow
Operational Excellence:
Proactively identifies and resolves product stability issues, thereby improving quality and availability
Expertise in designing and implementing advanced CI/CD and automation/resiliency concepts such as Progressive Rollouts and Failure Modes and Effects Analysis (FMEA)
Identifies and drives resiliency, cost optimization, and process improvements
Manages and performs on-call duties to ensure operational excellence and quick resolution of production incidents
Software Fundamentals:
Writes and reviews code to eliminate complexity while ensuring security, scalability, performance, testability, resiliency, and maintainability
Expert at diagnosing cross capability issues, with a focus on tooling and observability
Enhances test coverage including unit tests, end-to-end tests, and integration tests to maintain production system robustness
Design and Architecture:
Creates and promotes standard practices for tooling, monitoring, and observability
Develops tools that focus on improving system observability, including metrics, logging, and tracing
Communications:
Ability to convince people of their design, especially for tooling and observability solutions that ensure system reliability and performance
Is receptive to feedback from peers and acts accordingly, particularly in high-pressure incident resolution scenarios
Initiates and facilitates active technical and best practices discussions across multiple scrum teams
Collaborates with other team members to solve problems more effectively, emphasizing cross-functional collaboration during production incidents
Demonstrated ability to explain complex technical issues to both technical and non-technical audiences, focusing on aligning stakeholders on observability and SRE best practices