As a Site Reliability Engineer (SRE), you will be part of supporting a Data Platform that is part of the core data strategy across the firm. This roles offers an opportunity to work on a cutting-edge platforms and interact with business processes of a leading global bank. The role expects an end to end involvement in supporting, understanding the components in the big data platform, , building automation tools to reduce manual toil, adopting core SRE principles across technologies such as Big data, AWS, CI-CD.
Job responsibilities
- Monitors and troubleshoots data jobs across multiple data platforms.
- Troubleshoot issues which arises out of data loads.
- Interact with upstream and downstream data systems.
- Engage with development team throughout the life cycle to help troubleshoot issues in production and non-production environments.
- Troubleshoot priority incidents, facilitate blameless post-mortems, and ensure permanent closure of incidents.
- Proactively identifies hidden problems, patterns in data, and uses these insights to fix production issues.
- Work on builds and pipelines to deploy applications.
- Build small automation tools.
- Effectively communicate issues across multiple teams.
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and 3+ years applied experience
- Minimum 5 years of relevant experience in development
- Demonstrate ability to work independently with strong ownership, collaboration & communication skills.
- Excellent debugging and troubleshooting skills.
- Experience with continuous integration/development toolkits like GIT/Subversion/Jenkins/Sonar
- Experience in understanding deployments and CI-CD tasks
- Hands-on knowledge on Big Data platform
- Ability to work on data analysis using SQL.
- Hands on AWS experience / Cloud certifications
- Experience deploying & supporting micro-services, cloud-based applications.
- Strong UNIX scripting skills
- Design, code, test and deliver software to automate manual operational work - TOIL.
Preferred qualifications, capabilities, and skills
- Intermediary understanding in at least one technology stack or programming language coding, testing, and delivering software (Java/Python/Scala)
- Experience maintaining a Cloud-base infrastructure.
- Familiar with site reliability concepts, principles, and practices