Job responsibilities
- Utilize High Security Access (HSA) systems, requiring enhanced screening for employment or assignments.
- Develop and implement proactive monitoring systems for automatic healing and recovery during system failures.
- Collect, analyze, and synthesize data to create visualizations and reports on application/system health, uptime, and performance improvements.
- Manage change and capacity for supported services.
- Foster strong relationships with development teams to build reliability throughout the lifecycle.
- Analyze incident/problem patterns, conduct post-mortems, and develop permanent remediation plans with automation to prevent future incidents.
- Promote a team culture of diversity, equity, inclusion, and respect.
Required qualifications, capabilities, and skills
- Formal training or certification on site reliability engineering concepts and 3+ years applied experience
- Proficient in platform skills across Linux, UNIX, and Windows, as well as application and middleware support.
- Expertise in observability, instrumentation, monitoring, alerting, and responding to performance and availability issues using tools like AppDynamics, Dynatrace, Splunk, and Grafana.
- Experience with automation and configuration tools, including Ansible, Puppet, and Chef.
- Hands on experience in scripting with one of programming languages such as Python or Java.
- Experienced in managing critical application outages in large-scale operations, driving root cause analysis and remediation, with familiarity in Jenkins, GIT, CI/CD pipelines, and Agile/Scrum practices.
Preferred qualifications, capabilities, and skills
- Understanding of DevOps and SRE principles.
- Familiar with infrastructure management, capacity planning, and resilience strategies.