Expoint - all jobs in one place

The point where experts and best companies meet

Limitless High-tech career opportunities - Expoint

GE HealthCare Senior Site Reliability Engineer 
United States, Illinois 
707406616

30.03.2025

Roles and Responsibilities

  • Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria.
  • Develop automated solutions to address potential problems before they result in a service interruption.
  • Provide impact assessment and mitigation plan for changes going into the production environment.
  • Investigate root cause of severe and systemic outages, identify corrective actions and apply across the enterprise.
  • Develop availability measures that align with consumer experience to accurately assess the usability of crucial services.
  • Build capacity models to baseline transactional load compared to resource performance and leverage data to predict overall system capacity while automating load placement to avoid outages.
  • Identify thresholds for all critical links in the data path to quickly isolate where imbalances may result in potential outages.
  • Analyse failure points in services to model risk level and resolution steps if failure occurs.
  • Assist in driving architecture enhancements into system to mitigate potential failure points.
  • Programmatically monitor for and remediate configuration drift of critical devices.
  • Develop response plans to potential failure points and evaluate effectiveness during planned tests.
  • Perform comprehensive operational health checks of the entire services to identify areas of concern and track activities to drive improvements at all levels of the architecture.
  • Provide technical coaching and direction to more junior teammates.

Qualifications/Essential Requirements

  • Bachelor's Degree in Computer Science or STEM” Majors (Science, Technology, Engineering and Math) with at least 10 years of progressive experience.
  • Experience in site reliability engineering, with a focus on AWS.
  • Strong understanding of AWS Services, architecture, and best practices.
  • Experience with configuring, customizing, and extending monitoring /APM tools (Datadog, Kloudfuse, Grafana, Splunk, etc.)
  • Operational experience in complex distributed systems, including defining, measuring and monitoring SLO/SLAs for availability and reliability goals.
  • Experience with incident management and post-incident reviews.
  • AWS Certified Solutions Architect Associate, AWS Certified DevOps Engineer is a plus.
Preferred Qualification
  • Expertise on management & administration of Kubernetes clusters.
  • Strong background in scripting, automation, configuration management, and infrastructure-as-code practices (Terraform AWS CloudFormation, Crossplane, Pulumi etc.)
  • Good understanding of DevOps practices, CI/CD pipelines, version control systems (Git). Experience in GitOps is a plus.
  • Strong knowledge on Unix based operating systems & workload management and networking systems.