The Role:
As part of the CDH engineering team, you will be responsible for, identifying bottlenecks (e.g., long-running data extractions), implementing targeted fixes, and proactively maintaining the reliability and performance of our cloud-native solution. You’ll also collaborate closely with developers and customer-facing colleagues to ensure seamless support and continuous improvement of our platform.
What You'll Do:
- Act as DevOps point of contact for customer-facing issues, ensuring fast triage and resolution
- Analyze performance issues in complex distributed systems, especially related to data extractions and processing pipelines
- Implement small fixes or configuration changes to stabilize or improve performance
- Improve observability by enhancing dashboards, alerts, and logs for early anomaly detection
- Maintain CI/CD workflows, automation scripts, and cloud infrastructure components
- Drive continuous improvement through post-incident reviews, documentation, and process refinement
- Support the team with operational excellence and knowledge-sharing across locations
What you'll bring:
- Experience in DevOps or Site Reliability Engineering
- Strong troubleshooting skills in distributed systems, preferably in a data-centric or backend-heavy environment
- Solid development skills (e.g., Python, Node, Java script)
- Familiarity with Node.js and JavaScript-based backend systems – ability to understand logs, analyze issues, and contribute small fixes when needed
- Experience with observability tools such as Grafana, Prometheus or similar
- Hands-on experience with cloud platforms
- Strong sense of ownership, accountability, and a customer-first mindset
- Excellent communication and collaboration skills – comfortable working in cross-functional team
We win with inclusion
Successful candidates might be required to undergo a background verification with an external vendor.