Job responsibilities
- Utilizes technical expertise and problem-solving techniques to enhance large-scale system operations, while ensuring comprehensive monitoring of operating systems.
- Resolves most nuances and determines appropriate escalation path.
- Executes conventional approaches to build or break down technical problems.
- Drives the daily activities supporting the standard incident resolution process.
- Partners with application and infrastructure teams to identify potential stability and capacity risks and govern remediation statuses.
- Handles major incidents, conducts root cause analysis, and implements scheduled changes within the Linux infrastructure.
- Maintains the overall health of the system and oversees risk management and control measures in the environment.
- Adds to team culture of diversity, equity, inclusion, and respect.
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and 3+ years applied experience
- Hands on experience in managing large-scale Linux infrastructure, ideally within a financial institution.
- Proficient in Red Hat Enterprise Linux (RHEL) administration, with experience in capacity management, business continuity planning and execution, and engineering, complemented by strong troubleshooting skills.
- Deep knowledge of one or more areas of infrastructure engineering such as operating systems [Linux], hardware, networking terminology, databases, storage engineering, deployment practices, integration, automation, scaling, resilience, or performance assessments.
- Deep knowledge of one specific infrastructure technology and scripting languages (e.g., Scripting, Python, Ansible, etc.).
- Drives to continue to develop technical and cross-functional knowledge outside of the product.
- Proficient in multiple infrastructure technologies and have the operations skills to identify and mitigate difficult and complex technical problems.
- Ability to articulate to more experienced management a technical strategy in clear, concise, understandable terms.
- Experience with capacity management, resiliency, and business continuity planning and execution
- Working experience in Financial Institutions, Knowledge of Linux System Administration, DevOps and Site Reliability Engineer related experience.
- Good knowledge in Splunk, Prometheus, and Grafana.
Preferred qualifications, capabilities, and skills
- Experience working on multiple service improvement programs and able to prioritize workloads.
- Experience with multi-tiered application architecture and a proven track record of developing and implementing IT strategy and plans.
- Familiarity with agile development methodologies and development tools such as JIRA, GIT, and Bitbucket.
- Experience or certification on AWS will be an added advantage.