How we store and manage data has completely changed over the last decade. We moved from an ETL world to an ELT world, with companies like Fivetran pushing the trend. However, we don’t think it is going to stop there; ELT is a transition in our mind towards EL+T (with EL decoupled from T). And to understand this, we need to discern the underlying reasons for this trend, as they might show what’s in store for the future. This is what we will be doing in this article. I’m the co-founder of , the new upcoming open-source standard for data integrations. Airbyte What are the problems with ETL? Historically, the data pipeline process consisted of extracting, transforming, and loading data into a warehouse or a data lake. There are serious disadvantages to this sequence. Inflexibility ETL is inherently rigid. It forces data analysts to know beforehand every way they are going to use the data, every report they are going to produce. Any change they make can be costly. It can potentially affect data consumers downstream of the initial extraction. Lack of visibility Every transformation performed on the data obscures some of the underlying information. Analysts won’t see all the data in the warehouse, only the one that was kept during the transformation phase. This is risky, as conclusions might be drawn based on data that hasn’t been properly sliced. Lack of Autonomy for Analysts Last but not least, building an ETL-based data pipeline is often beyond the technical capabilities of analysts. It typically requires the close involvement of engineering talent, along with additional code to extract and transform each source of data. The alternative to a complex engineering project is to conduct analyses and build reports on an ad hoc, time-intensive, and ultimately unsustainable basis. What changed and why ELT is way better Cloud-based Computation and Storage of Data The ETL approach was once necessary because of the high costs of on-premises computation and storage. With the rapid growth of cloud-based data warehouses such as Snowflake, and the plummeting cost of cloud-based computation and storage, there is little reason to continue doing transformation before loading at the final destination. Indeed, flipping the two enables analysts to do a better job in an autonomous way. ELT Supports Agile Decision-Making for Analysts When analysts can load data before transforming it, they don’t have to determine beforehand exactly what insights they want to generate before deciding on the exact schema they need to get. Instead, the underlying source data is directly replicated to a data warehouse, comprising a Analysts can then perform transformations on the data as needed. Analysts will always be able to go back to the original data and won’t suffer from transformations that might have giving them a free hand. This makes the business intelligence process incomparably more flexible and safe. “single source of truth.” compromised the integrity of the data, ELT Promotes Data Literacy Across the Whole Company When used in combination with cloud-based business intelligence tools such as Looker, Mode, and Tableau, the ELT approach also broadens access to a common set of analytics across organizations. Business intelligence dashboards become accessible even to relatively non-technical users. We’re big fans of ELT at Airbyte, too. But ELT is and has problems of its own. We think EL needs to be completely decoupled from T. not completely solving the data integration problem What’s changing now and why EL+T is the future Merging of Data Lakes and Warehouses There was a great analysis by Andreessen Horowitz about . Here is the architecture diagram of the modern data infrastructure they came up with after a lot of interviews with industry leaders. how data infrastructures are evolving Data infrastructure serves two purposes at a high level: Helps business leaders make better decisions through the use of data — analytic use cases Builds data intelligence into customer-facing applications, including via machine learning — operational use cases Two parallel ecosystems have grown up around these broad use cases. The data warehouse forms the foundation of the analytics ecosystem. Most warehouses store data in a structured format. They are designed to generate insights from core business metrics, usually with SQL (although Python is growing in popularity). The data lake is the backbone of the operational ecosystem. By storing data in raw form, it delivers the flexibility, scale, and performance required for applications and more advanced data processing needs. Data lakes operate on a wide range of languages including Java/Scala, Python, R, and SQL. What’s really interesting is that modern data warehouses and data lakes are starting to resemble one another — both offering commodity storage, native horizontal scaling, semi-structured data types, ACID transactions, interactive SQL queries, and so on. So you might be wondering if data warehouses and data lakes are on a path toward convergence. Will they become interchangeable in a stack? Will data warehouses also be used for the operational use case? EL+T Supports Both Use Cases: Analytics and Operational ML EL, in contrast to ELT, completely decouples the Extract-Load part from any optional transformation that may occur. The operational use cases are all unique in the way incoming data is leveraged. Some might use a unique transformation process; some might not even use any transformation. In regards to the analytics case, analysts will need to get the incoming data normalized for their own needs at some point. But decoupling EL from T would let them choose whichever normalization tool they want. has been gaining a lot of traction lately among data engineering and data science teams. It has become the open-source standard for transformation. Even Fivetran integrates with them to let teams use DBT if they’re used to it. DBT EL Scales Faster and Leverages the Whole Ecosystem Transformation is where all the edge cases lie. For every specific need within any company, there is a schema normalization unique to it, for each and every one of the tools. By decoupling EL from the T, this enables the industry to start covering the long tail of connectors. At Airbyte, we’re building a “ “ so we can get to 1,000 pre-built connectors in a matter of months. connector manufacturing plant Furthermore, as mentioned above, it would help teams leverage the whole ecosystem in an easier way. You start to see an open-source standard for every need. In a sense, the future data architecture might look like this: In the end, extract and load will be decoupled from transformation. Do you agree with us? If so, you might be interested to have a look at what Airbyte does. Previously published at https://airbyte.io/articles/data-engineering-thoughts/why-the-future-of-etl-is-not-elt-but-el/

Andreessen Horowitz

Why and How We Used Singer to Bootstrap Our MVP

4 Critical Steps To Build A Large Catalog Of Connectors Remarkably Well

A Guide on The Future of ETL: EL(T) not ELT

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

4 Critical Steps To Build A Large Catalog Of Connectors Remarkably Well

Build a Live Dashboard with Materialize, Airbyte, MySQL and Redpanda/Kafka

Commoditized Data Integration And How To Achieve It

How to Chat With Your Data Using OpenAI, Pinecone, Airbyte and Langchain: A Guide

Open-source Effect On Build Vs. Buy

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

4 Critical Steps To Build A Large Catalog Of Connectors Remarkably Well

Build a Live Dashboard with Materialize, Airbyte, MySQL and Redpanda/Kafka

Commoditized Data Integration And How To Achieve It

How to Chat With Your Data Using OpenAI, Pinecone, Airbyte and Langchain: A Guide

Open-source Effect On Build Vs. Buy

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps