Job responsibilities
- Applies technical knowledge and problem-solving methodologies to projects of moderate scope, with a focus on improving the data and systems running at scale, and ensures end to end monitoring of applications
- Resolves most nuances and determines appropriate escalation path
- Executes conventional approaches to build or break down technical problems
- Drives the daily activities supporting the standard capacity process applications
- Partners with application and infrastructure teams to identify potential capacity risks and govern remediation statuses
- Considers upstream/downstream data and systems or technical implications
- Accountable for making significant decisions for a project consisting of multiple technologies and applications
- Adds to team culture of diversity, equity, inclusion, and respect
Required qualifications, capabilities, and skills
- Formal training or certification on infrastructure engineering concepts and 3+ years applied experience
- Strong knowledge of one or more infrastructure disciplines such as hardware, networking terminology, databases, storage engineering, deployment practices, integration, automation, scaling, resilience, and performance assessments
- Advanced in one or more scripting languages (e.g., Python, JavaScript, Shell etc.)
- Solid experience and understanding of monitoring and use of analysis tools for Security Incident & Event Management (SIEM).
- Hands on expertise in cloud services with an emphasis on Infrastructure as Code (IaC) utilizing tools such as Terraform.
- Experience with multiple cloud technologies (AWS/GCP/Azure) with the ability to operate in and migrate across public and private clouds
- Advanced understanding of agile methodologies such as CI/CD, Application Resiliency, and Security
- Drives to develop infrastructure engineering knowledge of additional domains, data fluency, and automation knowledge
Preferred qualifications, capabilities, and skills
- SIEM software experience
- Splunk Enterprise
- Prometheus
- AWS Certifications (ie. Practitioner, Solutions Architect, Security, Networking, Developer…)
- Grafana dashboard experience
- Site Reliability Engineering (SRE) practices