Job responsibilities
- Collaborates with other software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions in their applications
- Implements infrastructure, configuration, and network as code for the applications and platforms in your remit
- Understands service level indicators and utilizes service level objectives to proactively resolve issues before they impact customers
- Design and implement solutions to enhance the reliability and scalability of AI/ML
- Partner with product engineering teams to ensure the AI/ML systems are reliable and
high performing. - Develop observability, security, automation and fin-ops tools and orchestration.
- Build strong cross-functional relationships that foster engagements across the
organization and deliver solutions to user problems. - Debug and solve issues in a production environment, identify root cause and
remediate. - Participates in on-call rotations, incident management and escalation workflows.
- Take full ownership of problems, develop solutions, and acquire new knowledge to
complete the task. - Mentor and guide junior engineers.
Required qualifications, capabilities, and skills
- Formal training or certification on Site Reliability Engineering concepts and applied experience
- Expertise in SRE principles, reliability, scalability and performance of application and
infrastructure. - Expertise in programming with Python and Infrastructure as Code, tools such as
Terraform. - Experience working with distributed systems and cloud-native architecture in AWS.
- Systematic problem-solving and troubleshooting skills in a complex system.
- Excellent communication skills and ability to represent and present business and
technical concepts to stakeholders. - Self-managed, self-motivated with strong sense of ownership, urgency, and drive
Preferred qualifications, capabilities, and skills
- Prior experience working in AI, ML, or Data engineering.
- Expertise in container orchestration/Kubernetes.
- Prior experience developing Automation frameworks/AI Ops.
- Prior experience building observability and telemetry tools.