Job responsibilities
- Oversee the day-to-day operations of existing applications, ensuring they run smoothly and efficiently.
- Monitor system performance and troubleshoot issues to minimize downtime and maintain high availability.
- Collaborate with development teams to identify and address code-related issues.
- Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications
- Collaborates with technical experts, key stakeholders, and team members to resolve complex problems
- Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers
- Supports the adoption of site reliability engineering best practices within your team
- Manage the data pipelines and calculation engines that support risk assessment and reporting
- Work on projects that involve cloud integration, big data processing, and advanced data analytics to support the Market Risk group.
Required qualifications, capabilities, and skills
- [Action Required: Insert 1st bullet according to Years of Experience table]
- Proficient in site reliability culture and principles and familiarity with how to implement site reliability within an application or platform
- Proficient in at least one programming language such as Python, Java/Spring Boot, and .Net
- Proficient knowledge of software applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, Android, etc.)
- Experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Experience with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
- Familiarity with container and container orchestration such as ECS, Kubernetes, and Docker
- Familiarity with troubleshooting common networking technologies and issues
- Experience with automation tools and frameworks (e.g., Ansible, Terraform, Jenkins).
- Solid understanding of cloud platforms and services (e.g., AWS, Azure, Google Cloud).
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
- Excellent problem-solving skills and a proactive approach to identifying and addressing issues.
- Strong communication and collaboration skills, with the ability to work effectively in a team environment.
Preferred qualifications, capabilities, and skills
- Ability to contribute to large and collaborative teams by presenting information in a logical and timely manner with compelling language and limited supervision
- Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation
- Ability to identify new technologies and relevant solutions to ensure design constraints are met by the software team
- Ability to initiate and implement ideas to solve business problems