Build ETL pipelines using technologies such as Python and Spark
Implement new ETL pipelines on top of a variety of architectures (e.g. file-based, streaming)
Determine best strategies for building AI tools, including how best to chunk and retrieve RAG-based data and which LLMs are most appropriate to support use cases
Stay abreast of industry trends in the AI space, and evaluate and incorporate new concepts/tools into MongoDB’s internal AI architecture
Make architectural decisions relating to storing large datasets using a variety of file formats (e.g. Parquet, JSON) and table types (e.g. Iceberg, Hive)
Work with Security and Compliance teams to ensure that datasets have appropriate permissions and regulations in place
Work with Data Analysts and Data Scientists to understand and make available the data that is important for their analysis
Work with our Data Platform, Architecture, and Governance sibling teams to make data scalable, consumable, and discoverable
We’re looking for someone with:
5+ years of building ETL pipelines for a Data Lake/Warehouse
1+ year building AI and RAG-based applications
5+ years Python experience
5+ years Spark experience
Hive, Iceberg, Glue, or other technologies that expose big data as tables
Familiarity with different big data file types such as Parquet, Avro, and JSON
Success Measures:
In 3 months, you'll have a thorough understanding of the architecture of MongoDB’s internal Data Lake and AI ecosystem
In 6 months, you'll have owned the delivery of a large project from start (scoping, design) to finish (delivery)
In 12 months, you'll have designed new features, led development work, and become a go-to expert on parts of the system