Job responsibilities
- Applies technical expertise one or more High Security Access (HSA) systems.
- Provide operational and development expertise to establish proactive monitoring and contribute to automatic healing and recovery during system failures.
- Collect, analyze, and synthesize data to create visualizations and reports on application/system health, uptime, performance enhancements, and change/capacity management for supported services.
- Develop strong relationships and collaborate with the development team throughout the project lifecycle to enhance reliability. Identify and analyze incident/problem patterns, conduct thorough post-mortems, create permanent remediation plans, and implement automation to prevent future incidents.
- Proactively identify hidden issues and patterns to facilitate maximum delivery speed by effectively managing service disruptions.
- Contribute to a team culture that values diversity, equity, inclusion, and respect
Required qualifications, capabilities, and skills
- Formal training or certification on Infrastructure engineering concepts and 5+ years applied experience
- Proficient in platform skills across Linux, UNIX, and Windows, with strong knowledge in application and middleware support.
- Hands on with automation and configuration management tools such as Ansible, Puppet, and Chef.
- Skilled in scripting and programming languages, including Python and Java. Ability to lead and provide technical guidance to a team.
- Experienced in managing critical application outages in large-scale operations, conducting root cause analysis, and implementing remediation strategies.
- Experience with instrumentation, monitoring, alerting, and response processes related to application performance and availability, using tools like AppDynamics, Dynatrace, Grafana, and Splunk.
- Knowledge in Jenkins, GIT, CI/CD pipelines, and Agile and Scrum methodologies.
Preferred qualifications, capabilities, and skills
- Familiar with concepts and principles behind DevOps and SRE.
- Familiar with Cloud Engineering & understanding of private cloud principles and exposure to public cloud offerings such as AWS/Azure/Google cloud or similar technology.