מציאת משרת הייטק בחברות הטובות ביותר מעולם לא הייתה קלה יותר
The SPRE (Software Production Resilience) team is seeking a Site Reliability Engineer (SRE) with passion for maintaining highly reliable cloud-based services. In this role, you will support Red Hat’s software manufacturing services on our hybrid cloud infrastructure. You will partner with development, quality engineering and release engineering colleagues to support the health and well-being of the infrastructure hosting Software Production services. Maintaining service monitoring, improving automation and upholding security best practices will be your daily work. You will participate in communities of practice to coordinate and influence the design of our hybrid cloud platform. You will be co-responsible for defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for the services the team needs to support stakeholders, and executing remediation plans if the SLOs are not met.
In this role, you are expected to respond in a timely manner during a critical outage, and participate in learning events to identify improvements that will make our services more resilient. Join us and our passion in helping Red Hat to produce world-class open source software.
What you will do:
Design, build, and revise CI/CD systems
Work in a geographically distributed team
Configure and maintain service infrastructure
Write automation and documentation to make service maintenance faster, easier, and less error-prone
Coordinate your actions with other Red Hat teams such as IT Platforms, Storage and Network and ensure our internal cloud deployment meets expectations
Provide consulting on infrastructure health, status, stability, and enhancements to other internal teams
Contribute to and highlight the requirement to enforce best practices and change management for the infrastructure supporting software production services
Migrate software production services from legacy environments to our hybrid cloud infrastructure
Assess and champion opportunities to use Red Hat's emerging solutions in our engineering pipeline
Develop best practices around next generation deployment patterns like service-mesh and serverless; migrate advanced projects to those patterns
Implement monitoring, alerting, and escalation plans in the event of an infrastructure outage or performance problem
Work with service owners to co-define SLIs and SLOs for the services your team relies on, ensure they are met, and execute remediation plans if they are not
What you will bring:
Linux administration experience
Working knowledge of AWS technologies like S3, DynamoDB, Lambda, CloudFront, CloudFormation, IAM, KMS and Kinesis
Ability to work Hybrid in Raleigh NC, Durham NC, Boston MA or Lowell MA
Experience with container-related technologies like Kubernetes
Experience with CI/CD platforms like GitHub Actions and Jenkins
Experience with automation services like Ansible or Terraform
Ability to understand graphically represented concepts and architectures in documentation
Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team
The following skills will be considered a plus:
Previous experience with SRE model is a plus
Experience with OpenTelemetry or Prometheus is a plus
Experience with software development using Python or GoLang will be considered a plus
Advance understanding of networking and security practices will be considered a plus
The salary range for this position is $74,900.00 - $119,830.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
משרות נוספות שיכולות לעניין אותך