Expoint - all jobs in one place

Finding the best job has never been easier

Limitless High-tech career opportunities - Expoint

Red hat Senior Site Reliability Engineer 
Czechia, Southeast, Brno 
514207660

04.08.2024

The SPRE (Software Production Resilience) team is seeking a Site Reliability Engineer (SRE). The team is looking for a self-motivated person who has a passion for maintaining highly reliable cloud-based services. In this role, you will support Red Hat’s software manufacturing services on our hybrid cloud infrastructure. You will partner with development, quality engineering and release engineering colleagues to support the health and well-being of the infrastructure hosting software production services. Maintaining service monitoring, improving automation, and upholding security best practices will be your daily work. You will participate in communities of practice to coordinate and influence the design of our hybrid cloud platform. You will be responsible for defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the services the team needs to support stakeholders, and executing remediation plans if the SLOs are not met.

Job Responsibilities

  • Design, build, and revise CI/CD systems

  • Work in a geographically distributed team

  • Configure and maintain service infrastructure

  • Write automation and documentation to make service maintenance faster, easier, and less error-prone

  • Coordinate the needs of your team to other Red Hat teams such as IT Platforms and IT Storage and ensure our internal cloud deployment meets expectations

  • Provide consulting on infrastructure health, status, stability, and enhancements to other internal teams

  • Contribute to and highlight the requirement to enforce best practices and change management for the infrastructure supporting software production services

  • Migrate software production services in legacy environments to our hybrid cloud infrastructure

  • Assess and champion opportunities to use Red Hat's emerging solutions in our engineering pipeline

  • Develop best practices around next generation deployment patterns like service-mesh and serverless; migrate advanced projects to those patterns

  • Implement monitoring, alerting, and escalation plans in the event of an infrastructure outage or performance problem

  • Work with service owners to define SLIs and SLOs for the services your team relies on, ensure they are met, and execute remediation plans if they are not

  • Manage projects and requirements gathering, and translation to work items

Required skills

  • Linux administration experience

  • Working knowledge of AWS technologies like S3, DynamoDB, Lambda, CloudFront, CloudFormation, IAM, KMS and Kinesis

  • Experience with container-related technologies like Docker or Kubernetes

  • Experience with CI/CD platforms like GitHub Actions and Jenkins

  • Experience with Ansible

  • Ability to graphically represent concepts and architectures in documentation

  • Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team

The following skills will be considered a plus:

  • Experience with Terraform is a plus

  • Experience with OpenTelemetry or Prometheus is a plus

  • Experience with software development using Python will be considered a plus

  • Advance understanding of networking and security practices will be considered a plus