Data Engineer III
You’ll work closely with different data engineering teams on their incident management process, post-mortem, root cause analysis, and preventing incidents recurrence.
What You'll Do
- You will collaborate with engineering teams to improve, maintain, performance tune and capacity plan for Vimeo’s data platforms and infrastructure.
- Design business continuity and disaster recovery plans and processes and work with the engineering team in implementation.
- You will drive the incident management process for our data platform, working with our partner teams to perform incident post-mortems, root cause analysis, and prevent recurring incidents.
- You will lead the standard change and release management process, automate and promote related best practices across engineering teams and help Vimeo to meet and maintain legal compliance status.
- Build intelligent monitoring over data pipelines and infrastructure to achieve early and automated anomaly detection.
- You'll work closely with software developers to build an end-to-end automated testing framework and system-level testing environment.
- Participate in an on-call rotation.
What To Bring
- You have production experience with distributed data stores, e.g. Hbase, zookeeper, Kafka Own, manage, monitor, and optimize the reliability and overall health of our development and production environments
- Detailed problem-solving approach, coupled with a strong sense of ownership and drive
- A passionate bias to action and passion for delivering high-quality data solutions
- 2+ years of experience working on Linux environment, and proficient with cloud environment (AWS, GCP)
- Experience with container orchestration platforms, particularly Kubernetes, for managing and deploying data processing and analysis applications.
- Experience coding in one or more of the following programming languages: Python, Java (mandatory), or Scala
- 2+ years of hands-on experience in Reliability Engineering for high-performant, scalable, and distributed data systems with a focus on automation
- Experience in config management systems like Chef, Puppet, Ansible, or Terraform.
- Deep understanding of CI/CD principles, familiar with source control systems (Git)
- Work with peer SREs to roll out changes to our production environment and help mitigate data-related production incidents.
- Experience with a Change Data Capture system, such as Debezium, is a plus.
- Attention to detail and quality with excellent problem-solving and interpersonal skills
- A bonus - you have some experience in data warehousing and data engineering