Finding the best job has never been easier
Share
The SPRE (Software Production Resilience) team is seeking a Site Reliability Engineer (SRE). The team is looking for a self-motivated person who has a passion for maintaining highly reliable cloud-based services. In this role, you will support Red Hat’s software manufacturing services on our hybrid cloud infrastructure. You will partner with development, quality engineering and release engineering colleagues to support the health and well-being of the infrastructure hosting software production services. Maintaining service monitoring, improving automation, and upholding security best practices will be your daily work. You will participate in communities of practice to coordinate and influence the design of our hybrid cloud platform. You will be responsible for defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the services the team needs to support stakeholders, and executing remediation plans if the SLOs are not met.
Job Responsibilities
Design, build, and revise CI/CD systems
Work in a geographically distributed team
Configure and maintain service infrastructure
Write automation and documentation to make service maintenance faster, easier, and less error-prone
Coordinate the needs of your team to other Red Hat teams such as IT Platforms and IT Storage and ensure our internal cloud deployment meets expectations
Provide consulting on infrastructure health, status, stability, and enhancements to other internal teams
Contribute to and highlight the requirement to enforce best practices and change management for the infrastructure supporting software production services
Migrate software production services in legacy environments to our hybrid cloud infrastructure
Assess and champion opportunities to use Red Hat's emerging solutions in our engineering pipeline
Develop best practices around next generation deployment patterns like service-mesh and serverless; migrate advanced projects to those patterns
Implement monitoring, alerting, and escalation plans in the event of an infrastructure outage or performance problem
Work with service owners to define SLIs and SLOs for the services your team relies on, ensure they are met, and execute remediation plans if they are not
Manage projects and requirements gathering, and translation to work items
Required skills
Linux administration experience
Working knowledge of AWS technologies like S3, DynamoDB, Lambda, CloudFront, CloudFormation, IAM, KMS and Kinesis
Experience with container-related technologies like Docker or Kubernetes
Experience with CI/CD platforms like GitHub Actions and Jenkins
Experience with Ansible
Ability to graphically represent concepts and architectures in documentation
Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team
The following skills will be considered a plus:
Experience with Terraform is a plus
Experience with OpenTelemetry or Prometheus is a plus
Experience with software development using Python will be considered a plus
Advance understanding of networking and security practices will be considered a plus
These jobs might be a good fit