AI Shouldn’t Have to Waste Time Reinventing ETL

The recent progress in AI is very exciting. People are using it in all sorts of novel ways, from improving customer support experiences and writing and running code, to making new music and even accelerating medical imaging technology.

But in the process, a worrying trend has emerged: the AI community seems to be reinventing data movement (aka ETL). Whether they call them connectors, extractors, integrations, document loaders, or something else, people are writing the same code to extract data out of the same APIs, document formats, and databases and then load them into vector DBs or indices for their LLMs.

The problem is that building & maintaining robust extraction and loading pipelines from scratch is a huge commitment. And there’s so much prior art in that area that for almost all engineers or companies in the AI space, it’s a huge waste of time to rebuild it. In a space where breaking news emerges approximately every hour, the main focus should be on making your core product incredible for your users, not going on sidequests. And for almost everyone, the core product is not data movement; it’s the AI-powered magic sauce you’re brewing.

A lot has been written (1, 2) about the challenges involved in building robust Extract, Transform, and Load (ETL) pipelines, but let’s contextualize it within AI.

Why does AI need data movement?

LLMs trained on public data are great, but you know what’s even better? AIs that can answer questions specific to us, our companies, and our users. We’d all love it if ChatGPT could learn our entire company wiki, read all of our emails, Slack messages, meeting notes and transcripts, plug into our company’s analytics environment, and use all of these sources when answering our questions. Or when integrating AI into our own product (for example with Notion AI), we'd want our app’s AI model to know all the information we have about a user when helping them.

Data movement is a prerequisite for all that.

Whether you’re fine-tuning a model or using Retrieval-Augmented Generation (RAG), you need to extract data from wherever it lives, transform it into a format digestible by your model, then load it into a datastore your AI app can access to serve your use case.

The diagram above illustrates what this looks like when using RAG, but you can imagine that even if you’re not using RAG, the basic steps are unlikely to change: you need to Extract, Transform, and Load aka ETL the data in order to build AI models which know non-public information specific to you and your use case.

Why is data movement hard?

Building a basic functional MVP for data extraction from an API or database is usually – though not always – doable on quick (<1 week) notice. The really hard part is making it production-ready and keeping it that way. Let’s look at some of the standard challenges that come to mind when building extraction pipelines.

Incremental Extracts and state management

If you have any meaningful data volume, you’ll need to implement incremental extraction such that your pipeline only extracts the data it hasn’t seen before. To do this, you’ll need to have a persistence layer to keep track of what data each connection extracted.

Transient error handling, backoffs, resume-on-failure(s), air gapping

Upstream data sources all the time, sometimes without any clear reason. Your pipelines need to be resilient to this, and retry with the right backoff policies. If the failures are not-so-transient (but still not your fault) then your pipeline needs to be resilient enough to remember where it left off and resume from the same place once upstream is fixed. And sometimes, the problem coming from upstream is severe enough (like an API dropping some crucial fields from records) that you want to pause the whole pipeline altogether until you examine what’s happening and manually decide what to do.

Identifying & proactively fixing configuration errors

If you’re building data extraction pipelines to extract your customers’ data, you’ll need to implement some defensive checks to ensure that all the configuration your customers gave you to extract data on their behalf is correct and if they’re not, quickly give them actionable error messages. Most APIs do not make this easy because they don’t publish comprehensive error tables and even when they do, they rarely give you endpoints that you can use to check the permissions assigned to e.g API tokens, so you have to find ways to balance comprehensive checks with quick feedback for the user.

Authentication & secret management

APIs range in simplicity from simple bearer token auth to, uh, “creative” implementations of session tokens or single-use-token OAuth. You’ll need to implement the logic to perform the auth as well as manage the secrets which may be getting refreshed once an hour, potentially coordinating secret refreshes across multiple concurrent workers.

Optimizing extract & load speeds, concurrency, and rate limits

And speaking of concurrent workers, you’ll likely want to implement concurrency to achieve a high throughput for your extractions. While this may not matter on small datasets, it’s absolutely crucial on larger ones. Even though APIs publish official rate limits, you’ll need to empirically figure out the best parallelism parameters for maxing out the rate limit provided to you by the API without getting IP blacklisted or forever-rate-limited.

Adapting to upstream API changes

APIs change and take on new undocumented behaviors or quirks all the time. Many vendors publish new API versions quarterly. You’ll need to keep an eye on how all these updates may impact your work and devote engineering time to keep it all up to date. New endpoints come up all the time, and some change their behavior (and you don’t always get a heads up).

Scheduling, monitoring, logging, and observability

Beyond the code which extracts data from specific APIs, you’ll also likely need to build some horizontal capabilities leveraged by all of your data extractors. You’ll want some scheduling as well as logging and monitoring for when the scheduling doesn’t work, or when other things go wrong and you need to go investigate. You also likely want some observability such as how many records were extracted yesterday, today, last week, etc… and which API endpoints or database tables did they come from.

Data blocking or hashing

Depending on where you’re pulling data from, you may need some privacy features for either blocking or hashing columns before sending them downstream.

To be clear, the above does not apply if you just want to move a few files as a one-time thing.

But it does apply when you’re building products that require data movement. Sooner or later, you’ll need to deal with most of these concerns. And while no single one of them is insurmountable rocket science, taken together they can quickly add up to one or multiple full time jobs, more so the more data sources you’re pulling from.

And that’s exactly the difficulty with maintaining data extraction & pipelines: the majority of its cost comes from the continuous incremental investment needed to keep those pipelines functional & robust. For most AI engineers, that’s just not the job that adds the most value to their users. Their time is best spent elsewhere.

So what’s an AI engineer gotta do to move some data around here?

If you ever find yourself in need of data extraction and loading pipelines, try the solutions already available instead of automatically building your own. Chances are they can solve a lot if not all of your concerns. If not, build your own as a last resort.

And even if existing platforms don’t support everything you need, you should still be able to get most of the way there with a portable and extensible framework. This way, instead of building everything from scratch, you can get 90% of the way there with off-the-shelf features in the platform and only build and maintain the last 10%. The most common example is long-tail integrations: if the platform doesn’t ship with an integration to an API you need, then a good platform will make it easy to write some code or even a no-code solution to build that integration and still get all the useful features offered by the platform. Even if you want the flexibility of just importing a connector as a python package and triggering it however you like from your code, you can use one of the many open-source EL tools like Airbyte or Singer connectors.

To be clear, data movement is not completely solved. There are situations where existing solutions genuinely fall short and you need to build novel solutions. But this is not a majority of the AI engineering population. Most people don’t need to rebuild the same integrations with Jira, Confluence, Slack, Notion, Gmail, Salesforce, etc… over and over again. Let’s just use the solutions that have already been battle-tested and made open for anyone to use so we can get on with adding the value our users actually care about.

Also appears here.