Building a Data Factory: A Generic ETL Pipeline Utility Case Study

FactSet, a leading provider of content in financial services, is focused on continuously improving our data pipeline and data fetch APIs. Most pipeline utilities like Flink and Spark require writing code to their API to define the pipeline, and to cover the breadth of content we offer, various departments have had to write custom ETL code for adding value at various parts of the content enrichment process. In order to standardize and simplify this common workflow, we decided to create a configuration file-based utility that still gives us the granular control we need, but allows encapsulation of the data movements and flows in centralized config files, reducing or eliminating the disparate custom ETL scripts. This case study will examine why we chose to leverage Golang and Apache Arrow to mix new data with our legacy sources and existing stack as we modernize our fetch code paths, and discuss other technologies we leveraged in order to do so.

Topics Covered

Apache Arrow Flight
Dremio Subsurface for Apache Arrow
In-Memory Formats
Interfaces

Speakers

William Whispell Dremio Author & Contributor

William Whispell

I enjoy working on data storage and retrieval. I’ve worked on various open source databases and our internal time series database. A lot of my work has been around platform migrations and breaking up monoliths into SOA. I enjoy taking on big data and low latency problems. Data flows, data pipelines, structured streaming and CDC are some of my areas of interest. I also enjoy finding creative ways to automate repetitive data operations engineers face on a regular basis.

Matt Topol Dremio Author & Contributor

Matt Topol

Hailing from the faraway land of Brentwood, NY, and currently residing in the rolling hills of Connecticut, Matt Topol has always been passionate about software. After graduating from Brooklyn Polytechnic (now NYU-Poly), he joined FactSet Research Systems, Inc. in 2009 to develop financial software. In the time since, Matt has worked in infrastructure and application development, has lead development teams, and has architected large-scale distributed systems for processing analytics on financial data. Matt is a committer on the Apache Arrow repository, frequently enhancing the Golang library and helping to grow the Arrow Community. Recently, Matt wrote the first and only book on Apache Arrow, “In-Memory Analytics with Apache Arrow,” and joined Voltron Data in order to work on the Apache Arrow libraries full-time and grow the Arrow Golang community.

In his spare time, Matt likes to bash his head against a keyboard, develop/run delightfully demented games of fantasy for his victims–er–friends, and share his knowledge with anyone interested who’ll listen to his rants.

get started

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now
demo on demand

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo
talk expert

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.