Your impact
Core Responsibilities
- Maintaining and building schedules so that pipelines run.
- Setting up and maintaining health checks on different pipelines.
- Responding to, triaging, and debugging pipelines when there is a problem (usually this is when health checks fail). This will include limited out of typical hours support for critical issues (e.g. a critical subset of alerts can page you overnight or during the weekend*).
- Reading code and writing code changes and/or modifying the monitoring set-up where necessary.
- Knowing and understanding how to navigate the pipelines and documentation.
- Following SOPs to contact other teams and data providers when data is incorrect or not received on time.
- Communicating outages with the end users of a pipeline.
- Contributing to and monitoring tooling improvements (where feasible).
* Active work not required on weekends or out of typical office hours, however due to the nature of on-call, the person must be available to respond if there is a critical outage during their assigned “on-call” weeks. After an on-call weekend the engineer will receive 2 days off.*
Here's what you'll need
- Comfortable reading and writing code in SQL, Python, Pyspark and Java.
- Basic understanding of Spark and familiar/interested in learning the basics of tuning Spark jobs.
- Practical experience with performing root cause analysis and documenting lessons learned from production incidents (e.g. creating post-mortem reports).
- Ability to work within an agile team.
- Strong written and verbal communication skills with the ability to skillfully engage with customers on complex, sensitive topics.
- Strong organizational skills and attention to detail through effective prioritization.
What We Require:
- Top Secret clearance or higher
- Located within commutable distance to our NYC office