What you will do:
- Design, implement, and maintain observability solutions (logging, monitoring, tracing) for cloud-native applications and infrastructure.
- Develop and optimize diagnostics tooling to quickly identify and resolve system or application-level issues.
- Monitor cloud infrastructure to ensure uptime, performance, and scalability, responding promptly to incidents and outages.
- Collaborate with development, operations, and support teams to drive improvements in system observability and troubleshooting workflows.
- Lead root cause analysis for major incidents, driving long-term fixes to prevent recurrence.
- Work with customer support teams to resolve customer-facing operational issues in a timely and effective manner.
- Automate operational processes and incident response tasks to reduce manual interventions and improve efficiency.
- Continuously assess and improve cloud observability tools, integrating new features and technologies where necessary.
- Create and maintain comprehensive documentation on cloud observability frameworks, tools, and processes
Who you will work with
As a member of the Site Reliability Engineering (SRE) team, you will collaborate with a diverse group of professionals across various functions and regions. You will work closely with:
- Software Engineering Teams: Partner with developers to ensure that new features and services are reliable, scalable, and observable from the outset. You'll participate in design reviews and contribute to the overall architecture to enhance system performance and reliability. Coordinate with SRE team to automate deployment processes, manage infrastructure as code, and ensure seamless deployment pipelines.
- Product Management: Engage with product managers to understand customer requirements and ensure that reliability and performance are integral parts of product roadmaps.
- DevOps and Infrastructure Teams: Customer Support: Collaborate with customer support teams to diagnose and resolve incidents, providing insights and tools that enable faster troubleshooting and improved user experiences.
- Security and Compliance Teams: Work alongside security experts to maintain compliance with industry standards, ensuring that all systems and processes adhere to security best practices.
- Global Network Operations Teams: Interact with global operations staff spread across India, Europe, Canada, and the USA to support 24/7 service reliability and incident response.
- Data Analytics and Reporting: Team up with data analysts to create meaningful dashboards and reports that provide insights into system performance and areas for improvement.
Who you are:
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent work experience.
- 8+ years of experience in cloud engineering, site reliability engineering (SRE), or DevOps.
- Expertise with cloud platforms (AWS, Azure, GCP) and related monitoring/observability tools (e.g., Prometheus, Grafana, Datadog, ELK Stack).
- Strong experience with diagnostics and troubleshooting tools for cloud services.
- Proficient in scripting languages (Python, Bash, etc.) and infrastructure-as-code (Terraform, CloudFormation).
- Experience in operational incident management, including root cause analysis and post-mortem reviews.
- Strong understanding of containerization (Docker, Kubernetes) and microservices architecture.
- Knowledge of network performance monitoring and debugging techniques.
Desire to solve complex problems
Proactive in communicating and managing stakeholders remotely and in various time-zones
But "Digital Transformation" is an empty buzz phrase without a culture that allows for innovation, creativity, and yes, even failure (if you learn from it.)