Job responsibilities
- Write high-quality , maintainable, and well-tested software to develop reliable and repeatable solutions to complex problems.
- Collaborate with product development teams to design, implement and manage CI/CD pipelines to support reliable, scalable, and efficient software delivery.
- Partner with product development teams to capture and define meaningful service level indicators (SLIs) and service level objectives (SLOs).
- Develop and maintain monitoring, alerting, and tracing systems that provide comprehensive visibility into system health and performance.
- Contribute to design reviews to evaluate and strengthen architectural resilience, fault tolerance and scalability.
- Uphold incident response management best practices, champion blameless postmortems and continuous improvements.
- Debug, track, and resolve complex technical issues to maintain system integrity and performance.
- Champion and drive the adoption of reliability and resiliency best practices.
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and 3+ years applied experience
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Experience analyzing, troubleshooting and supporting large-scale systems.
- Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.)
- Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
- Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker
- Practical experience building production-grade software in at least one programming language such as Java, Python, or Go.
- Solid understanding of the fundamentals of distributed systems, and reliability patterns for achieving redundancy, fault tolerance, and graceful degradation.
- Solid understanding of networking concepts, including TCP/IP, routing, firewalls, and DNS.
- In-depth knowledge of Unix/Linux, including performance tuning, process and memory management, and file system operations.
Preferred qualifications, capabilities, and skills
- Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team
- Ability to initiate and implement ideas to solve business problems
- Practical experience of one or more of the following:
- building, supporting and troubleshooting JVM based applications, including experience with tools such as JConsole, or VisualVM.
- use and support of SQL and in-memory database technologies.
- building and maintaining CI/CD pipelines using modern tools such as Github Actions, or Gitlab CI/CD.
- observability and monitoring tools such as Prometheus, Grafana, or OpenTelemetry.
- containers and orchestration platforms such as Docker, Kubernetes, or Amazon ECS,
- cloud technologies such as AWS or GCP, including deployment, management, and optimization of cloud-based applications.
- performance and chaos testing tools such as Gremlin, Chaos Mesh, and LitmusChaos.
- Experience working in the financial/fintech industry.