Your Impact
As a Senior Site Reliability Engineer, you will play a leading role in improving the efficiency, scalability, and reliability of the XDR Incident Generation team. Your work will focus on implementing sophisticated automation, encouraging a culture of operational perfection, and driving the adoption of Infrastructure as Code (IaC) and CI/CD best-practices. Additionally, you'll mentor team members, design robust platforms and services, and ensure their flawless lifecycle management.
Key Responsibilities
- Manage the full lifecycle of platform services, from design and implementation to maintenance.
- Promote and enforce Infrastructure-as-Code (IaC) practices to enable scalable, version-controlled, and auditable infrastructure.
- Lead the automation of build, deploy, and release processes to boost team efficiency and innovation.
- Design, develop, and maintain modern CI/CD - pipelines aligned with industry best-practices.
Minimum Qualifications
- Extensive experience with AWS services (including VPC, S3, Lambda, SQS, Network Firewall, ECS/EKS, IAM, DynamoDB or CloudWatch) along with expertise in AWS security and/or cost optimization.
- Proficiency in Infrastructure-as-Code tools such as Terraform, and scripting/programming languages including Python and/or Bash.
- Experience in building and maintaining CI/CD pipelines using tools like GitHub Actions or TeamCity, combined with robust knowledge of incident management, postmortem analysis, and/or supervising SLOs/SLAs.
- Ability to participate in on-call rotation.
Preferred Qualifications
- Bachelors + 7 years, or Masters + 4 years of related experience.
- Collaborate across teams, effectively communicating technical concepts with transparency and precision.
- Mentor junior engineers, foster skill development, and uphold SRE best-practices within the team.
- Expertise in crafting AI driven workflows for incident response, forecasting potential issues (e.g., resource exhaustion, outages), and enabling auto-scaling or remediation.
- Proficient in integrating AI/ML tools for anomaly detection, threat response, and serverless architecture optimization on AWS.