5+ years of related experience with software development team, reliability or engineering excellence experience preferred
Expert in one of the following: Automation, Monitoring tools, Cloud Operations
Solid AWS experience
Solid and comfortable with backend or full stack coding and scripting: strong experience with Java/J2EE, Go, Python, REST, SOAP, JSON
Skilled in software development lifecycle processes. Experience with SCRUM and Agile Development
Knowledge of current trends and best practices in the modern SaaS technology landscape
Experience in leveraging Amazon Web Services for building scalable applications
High adaptability and flexibility
Work well under pressure
Easy to work with: you communicate well, collaborate, and harness people around you during crises
“Do whatever it takes” attitude
Have a passion for working on systems that are highly reliable, maintainable, scalable, and secure
High energy, self-starter with a positive mindset and with a "can do" attitude
Years experience in Fintech with large-scale payment systems
Operational Excellence:
Proactively identifies and resolves product stability issues, thereby improving quality and availability
Expertise in designing and implementing advanced CI/CD and automation/resiliency concepts such as Progressive Rollouts and Failure Modes and Effects Analysis (FMEA)
Identifies and drives resiliency, cost optimization, and process improvements
Manages and performs on-call duties to ensure operational excellence and quick resolution of production incidents
Software Fundamentals:
Writes and reviews code to eliminate complexity while ensuring security, scalability, performance, testability, resiliency, and maintainability
Expert at diagnosing and resolving cross capability issues, with a focus on tooling and observability
Enhances test coverage including unit tests, end-to-end tests, and integration tests to maintain production system robustness
Experience with metrics, monitoring and alerting tools such as Splunk, Wavefront, AppDynamics, Prometheus, and Pagerduty
Design and Architecture:
Promotes standard practices for tooling, monitoring, and observability
Develops tools that focus on improving system observability, including metrics, logging, and tracing
Communications:
Ability to convince people of their design, especially for tooling and observability solutions that ensure system reliability and performance
Is receptive to feedback from peers and acts accordingly, particularly in high-pressure incident resolution scenarios
Collaborates with other team members to solve problems more effectively, emphasizing cross-functional collaboration during production incidents
Demonstrated ability to explain complex technical issues to both technical and non-technical audiences