What you’ll do:
- Metrics : Implement comprehensive service metrics to track and report on system reliability, performance, and efficiency
- Optimization : Monitor system performance, identify bottlenecks, and execute pipeline optimization
- Collaborate with Scrum teams and other stakeholders to identify potential risks.
- Analysis : Conduct post-incident reviews to prevent recurrence and refine the system reliability framework
What you’ll bring:
- A bachelor's or master's degree in computer science, information systems, or a related technical field
- Between 4- 7 years of experience as a Site Reliability Engineer
- Proficiency in programming languages such as Python, Go, or Java
- In-depth understanding of operating systems, networking, and cloud services
- Experience with monitoring tools (for example, Datadog, ELK, Redash)
- Proven experience in managing large-scale distributed systems and understanding the principles of scalability and reliability
- Familiarity with DevOps culture and practices, and experience with CI/CD systems
- Excellent diagnostic and problem-solving skills, with the ability to analyze complex systems and data
- Certifications in cloud services, networking, or systems administration - Advantage
Our people are the foundation of our success, and we prioritize offering a wide range of benefits that make our team happier and healthier.
- Equity participation - everyone shares in our success
- Hybrid work
- Opportunities for professional growth
- Team fun & company outings
- Statutory benefits and leave benefits
- Health Insurance coverage
Our Values:
We look for people who embody our values - Care, Do, Try & Shine.
- - Wecareabout our customers and each other
- Do- Wedowhat it takes to make a positive impact
- Try- Wetryour best and we don’t give up
- Shine- Weshine