The role:
An operations-first engineer with deep expertise in running, scaling, automating, and monitoring streaming infrastructure like Kafka and RabbitMQ.
What you will do…
- Automate Deployment and Operation Oversee deployment of Kafka and RabbitMQ clusters (including Confluent Cloud & CFK). Build automation pipelines to ensure repeatability and resiliency across environments.
- Monitor and Support Production Systems Own production stability of global Kafka clusters. Handle on-call rotations, incident management, troubleshooting, and scaling challenges.
- Improve Infrastructure Observability Build and maintain observability systems: dashboards, alerting pipelines, metrics collection (Prometheus, Grafana, etc.).
- Optimize System Performance Collaborate with peers on benchmarking and optimization initiatives. Work on tuning Kafka brokers, cluster configurations, and runtime parameters.
- Provide Developer Support and Training (Infra-focused) Help developers configure topics, quotas, and consumers appropriately. Train service owners to interpret monitoring data and avoid pitfalls.
- Develop and Maintain Infrastructure Contribute to building infrastructure tools and scripts (IaC, Helm charts, etc.) that make provisioning and managing clusters reliable and efficient.
- Secure Infrastructure Access Configure and maintain secure access patterns across streaming infrastructure, ensuring proper authentication and role-based access controls are enforced for both developers and services.
What we expect…
- 8+ years of experience in DevOps , SRE , or Infrastructure Engineering roles.
- Deep hands-on Kafka experience , including deploying, maintaining, scaling, and monitoring clusters.
- Experience with RabbitMQ .
- Extensive experience with Docker , Kubernetes , Helm, and GitOps-style deployments.
- Infrastructure as Code experience (Terraform, Pulumi, etc.).
- Strong skills in scripting and automation (Python, Bash, etc.).
- Familiarity with Confluent Cloud , Confluent for Kubernetes , and similar tools.
- Solid understanding of authentication and authorization mechanisms in distributed systems.
- Production support mindset – with proven troubleshooting and incident resolution history.
- Collaboration and communication skills – especially with dev teams depending on platform support.
- Experience with Istio Service Mesh (bonus).
- Experience with GovCloud (bonus).
Bonus Qualities:- Mentorship and leadership experience in infrastructure or SRE teams.
- Contributions to automation or monitoring open-source tooling.
- Active participant in SRE or DevOps communities.
- Conference speaker or internal tech trainer.
- Technical writing about infrastructure automation or reliability.