About the Job
The Sr. Site Reliability Engineer applies a deep understanding of software and systems engineering principles to design and implement solutions that enhance service reliability.
This position requires good judgment and the ability to prioritize work effectively while contributing to the overall goals of the SRE team and organization.
What You Will Do
Lead the development and implementation of robust code and automation scripts to improve service reliability and scalability
Conduct thorough code reviews and testing processes to ensure the highest quality standards in the codebase
Work to solve moderately complex issues, making decisions that impact the service's reliability and performance
Mentor and guide junior engineers, fostering a collaborative environment focused on continuous improvement
Engage in a regular on-call rotation, taking responsibility for critical incidents and ensuring timely resolution
Lead incident response and postmortem processes, implementing solutions to prevent recurrence of issues
Collaborate with cross-functional teams to design, develop, and refine SRE tools and systems that support service objectives
Take ownership of tasks and projects, prioritizing them according to their impact on service health and team goals
What You Will Bring
Linux Systems Management: Extensive experience managing Linux servers, particularly Red Hat Enterprise Linux (RHEL), CentOS, or Fedora, within cloud environments such as AWS, GCP, or Azure; Includes advanced system administration, networking, and troubleshooting
Automation and Scripting: Proficient in writing and maintaining scripts for automation and orchestration tasks using tools like Ansible, Terraform, or custom scripts, to enhance efficiency and reduce manual workload
Monitoring and Observability: Expertise in setting up and managing enterprise monitoring and observability solutions (e.g., Prometheus, Grafana), enabling proactive detection and resolution of issues
Configuration Management: In-depth experience with configuration management tools such as Puppet, Chef, or similar, ensuring consistent and reproducible system states across environments
Incident Management: Proven ability to lead incident response efforts, from initial troubleshooting to root cause analysis and implementing preventative measures
Service Delivery and Optimization: Understanding of service delivery processes, with a focus on optimizing performance, reliability, and availability of hosted services
The salary range for this position is $127,890.00 - $211,180.00. Actual offer will be based on your qualifications.
Pay Transparency
● Comprehensive medical, dental, and vision coverage
● Flexible Spending Account - healthcare and dependent care
● Health Savings Account - high deductible medical plan
● Retirement 401(k) with employer match
● Paid time off and holidays
● Paid parental leave plans for all new parents
● Leave benefits including disability, paid family medical leave, and paid military leave
משרות נוספות שיכולות לעניין אותך