Job Responsibilities
- Design, deploy, and operate application infrastructure using Amazon EKS/ECS.
- Build and maintain foundational data storage infrastructure, including Aurora Postgres, OpenSearch, and Amazon S3.
- Deploy and operate open-source AI/ML software, ensuring scalability, security, and operational efficiency.
- Automate infrastructure provisioning and management using Terraform, Helm, Spinnaker, and related tools.
- Implement and uphold resiliency best practices, including defining and meeting SLAs/SLOs.
- Monitor and manage controls and hygiene alerts to maintain compliance and operational excellence.
- Lead initiatives to promote best practices in infrastructure engineering and DevOps.
- Collaborate closely with SRE and production monitoring teams to ensure system reliability, performance, and rapid incident response.
Required Qualifications, Capabilities, and Skills
Preferred Qualifications, Capabilities, and Skills
- Practical experience deploying LLM-based applications into production and an understanding of MLOps.