What You'll Do
- You will collaborate with engineering teams to improve, maintain, performance tune and capacity plan for Vimeo’s data platforms and infrastructure.
- Design business continuity and disaster recovery plans and processes and work with the engineering team in implementation.
- You will drive the incident management process for our data platform, working with our partner teams to perform incident post-mortems, root cause analysis, and prevent recurring incidents.
- You will lead the standard change and release management process, automate and promote related best practices across engineering teams and help Vimeo to meet and maintain legal compliance status.
- Build intelligent monitoring over data pipelines and infrastructure to achieve early and automated anomaly detection.
- You'll work closely with software developers to build an end-to-end automated testing framework and system-level testing environment.
- Participate in an on-call rotation.
Skills and knowledge you should possess:
- You have production experience with distributed data stores, e.g. Hbase, zookeeper, Kafka Own, manage, monitor, and optimize the reliability and overall health of our development and production environments
- Detailed problem-solving approach, coupled with a strong sense of ownership and drive
- A passionate bias to action and passion for delivering high-quality data solutions
- 3+ years of experience working on Linux environment, and proficient with cloud environment (AWS, GCP)
- Experience with container orchestration platforms, particularly Kubernetes, for managing and deploying data processing and analysis applications.
- Experience coding in one or more of the following programming languages: Python, Java (mandatory), or Scala
- 1+ years of hands-on experience in Reliability Engineering for high-performant, scalable, and distributed data systems with a focus on automation
- Experience in config management systems like Chef, Puppet, Ansible, or Terraform.
- Deep understanding of CI/CD principles, familiar with source control systems (Git)
- Work with peer SREs to roll out changes to our production environment and help mitigate data-related production incidents.
- Experience with a Change Data Capture system, such as Debezium, is a plus.
- Attention to detail and quality with excellent problem-solving and interpersonal skills
- A bonus - you have some experience in data warehousing and data engineering
Base Salary Range:
- NYC Metro, Bay Area, Seattle, & Los Angeles: $118,000 - $162,500
- All other US cities outside above metro areas: $106,200 - $146,250
We also offer paid time off, generous 401k match, commuter benefits, Health Savings Account (HSA), Flexible Spending Account (FSA), fertility reimbursement, group term life insurances, wellbeing resources, and more.