Job Responsibilities :
- Contribute as part of a global team to design, implement & maintain scalable highly available resilient systems for Finance and Regulatory Reporting domain
- Participate on initiatives for establishing and improving SRE best practices including monitoring Alerting, Incident response & Automation.
- Collaborate with Development and Business Operation teams to enhance reliability, performance & scalability of our applications portfolio.
- Implement DevOps methodology toolset such as CI/CD pipeline, Infrastructure as code & Automated deployments.
- Monitor and improve system observability using tools such as Prometheus, Grafana, ELK stack, Dynatrace, Control+M, etc.
- Optimize system performance and ensure compliance with security and regulatory standards.
- Participate on incident management and troubleshooting efforts ensuring minimal service disruption.
- Analyze system failure and conduct Root cause analysis to prevent future incidence.
- Recognize the toil within your role and proactively works towards eliminating it through either system engineering or updating application code.
- Understand observability pattern and strive to implement and improve service level indicators, objectives monitoring, and alerting solutions for optimal transparency and analysis.
- Implement and refine error budgets and SLI/SLO/SLA to improve reliability.
Required qualifications, capabilities, and skills :
- Formal training or certification on SRE principles and DevOps tools concepts and 2+ years applied experience
- Hands-on experience in SRE principles and DevOps tools.
- Proficiency in at least one Programming Language such as Python, GO, Shell, etc.
- Familiar with observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Splunk, and others
- Familiarity with containers or a common Server OS such as Linux and Windows
- Emerging knowledge of software, applications and technical processes within a given technical discipline (e.g., Cloud, artificial intelligence, etc.)
- Emerging knowledge of continuous integration and continuous delivery tools like Jenkins, GitLab, etc.
- Intermediate hands-on expertise in at least one relational database (eg, SQL Server, Oracle, Postgres) and Scheduling tools like Control-M, Autosys, etc.
- Ability to work in a large, collaborative team distributed globally. Demonstrates the willingness to vocalize ideas with peers and managers
- Eagerness to participate in learning opportunities to enhance one’s effectiveness in executing day-to-day project activities
Preferred qualifications, capabilities, and skills :
- Familiarity with containers/kubernetes
- Good understanding of Incident/Problem management and Incident Triage along with MTTD/MTTR
- Experience with CICD tools such as Jenkins and deployment automation tools
- Experience with version control software ( Git Hub/Bitbucket)
- Experience of working with public cloud and infra such as AWS/Azure etc.