Your Impact
As a Senior Site Reliability Engineer, you will play a leading role in enhancing the efficiency, scalability, and reliability of the XDR Incident Generation team. Your work will focus on implementing advanced automation, fostering a culture of operational excellence, and driving the adoption of Infrastructure as Code (IaC) and CI/CD best practices. Additionally, you'll mentor team members, design robust platforms and services, and ensure their seamless lifecycle management.
Key Responsibilities
- Manage the full lifecycle of platforms and services, from design and implementation to maintenance.
- Promote and enforce Infrastructure-as-Code (IaC) practices to enable scalable, version-controlled, and auditable infrastructure.
- Lead the automation of build, deploy, and release processes to boost team productivity and innovation.
- Design, develop, and maintain modern CI/CD pipelines aligned with industry best practices.
- Ability to participate in on-call rotation
Minimum Qualifications
- Extensive experience with AWS services (including VPC, S3, Lambda, SQS, Network Firewall, EKS, IAM, DynamoDB or CloudWatch) along with expertise in AWS security and cost optimization.
- Proficiency in Infrastructure-as-Code tools such as Terraform, and/or scripting/programming languages including Python and/or Bash.
- Experience in building and maintaining CI/CD pipelines using tools like GitHub Actions or ArgoCD or TeamCity, combined with robust knowledge of incident management, postmortem analysis, and/or tracking SLOs/SLAs.
Preferred Qualifications
- Bachelors + 7 years of related experience, or Masters + 4 years of related experience.
- Collaborate across teams, effectively communicating technical concepts with clarity and precision.
- Mentor junior engineers, foster skill development, and champion SRE best practices within the team.
- Expertise in designing AI-driven workflows for incident response, forecasting potential issues (e.g., resource exhaustion, outages), and enabling auto-scaling or remediation.
- Proficient in integrating AI/ML tools for anomaly detection, threat response, and serverless architecture optimization on AWS.