When talking about data processing, people often abbreviate it as “ETL.” However, if we look closely, data processing has undergone several iterations from ETL, ELT, XX ETL (such as Reverse ETL, Zero-ETL) to the currently popular EtLT architecture. While the Hadoop era mainly relied on ELT (Extract, Load, Transform) methods, the rise of real-time data warehouses and data lakes has rendered ELT obsolete. EtLT has emerged as the standard architecture for real-time data loading into data lakes and real-time data warehouses.
Let’s explore the reasons behind the emergence of these architectures, their strengths and weaknesses, and why EtLT is gradually replacing ETL and ELT as the global mainstream data processing architecture, along with practical open-source methods.
In the early days of data warehousing, Bill Inmmon, the proponent of data warehousing, defined it as a data storage architecture for partitioned subjects, where data was categorized and cleaned during storage. During this period, most data sources were structured databases (e.g., MySQL, Oracle, SQLServer, ERP, CRM), and data warehouses predominantly relied on OLTP databases (e.g., DB2, Oracle) for querying and historical storage. Handling complex ETL processes with such databases proved to be challenging. To address this, a plethora of ETL software emerged, such as Informatica, Talend, and Kettle, which greatly facilitated integrating complex data sources and offloading data warehouse workloads.
During the early and mid stages of data warehousing, when data source complexity was significantly higher, the ETL architecture became the industry standard and remained popular for over two decades.
As data volumes grew, both data warehouse and ETL hardware costs escalated. New MPP (Massively Parallel Processing) and distributed technologies emerged, leading to the gradual shift from ETL to ELT architecture in the later stages of data warehousing and the rise of big data. Teradata, one of the major data warehouse vendors, and Hadoop Hive, a popular Hadoop-based data warehousing solution, adopted ELT architecture. They focused on direct loading (without complex transformation like join and group) of data into the data warehouse’s data staging layer, followed by further processing using SQL or H-SQL from the staging layer to the data atomic layer and finally to the summary layer and indicator layer. While Teradata targeted structured data and Hadoop targeted unstructured data, they adopted similar 3–4 layer data storage architectures and methodologies for data warehousing globally.
ODS (Operational Data Store) was introduced as a transitional solution to handle complex data sources that ELT-based data warehouses couldn’t load and to improve real-time capabilities. ODS involved processing complex data sources through real-time CDC (Change Data Capture), real-time APIs, or short-batch processing (Micro-Batch) into a separate storage layer before ELT-ing them into the enterprise data warehouse. Currently, many enterprises still adopt this approach. Some companies place the Operational Data Store (ODS) within the data warehouse and use Spark and MapReduce for initial ETL (Extract, Transform, Load) processes. Later, they perform business data processing within the data warehouse (using tools like Hive, Teradata, Oracle, and DB2).
At this stage, the early EtLT (Extract, Transform, Load, and Transform) community has already formed. It is characterized by a division of roles, where the complex processes of data extraction, Change Data Capture (CDC), data structuring, and standardization are often handled by data engineers, referred to as “t.” Their objective is to move data from the source system to the underlying data preparation layer or data atomic layer within the data warehouse. On the other hand, the processing of complex data atomic layers with business attributes, data aggregation, and generating data metrics (involving operations such as Group by and Join) is typically performed by business data engineers or data analysts who are skilled in using SQL.
As a result of the emergence of the EtLT architecture, standalone projects like ODS have gradually faded out of the limelight due to the increase in data volume and the adoption of EtLT principles.
The EtLT architecture, as summarized by James Densmore in the Data Pipelines Pocket Reference 2021, is a modern and globally popular data processing framework. EtLT emerged in response to the transformations in the modern data infrastructure.
The modern data infrastructure has the following characteristics, which led to the emergence of the EtLT architecture:
In the current global enterprise landscape, the advent of cloud and SaaS has made an already complex data source environment even more intricate. Dealing with SaaS data has given rise to a new concept of data ingestion, exemplified by tools like Fivetran and Airbyte, aiming to address the ELT (Extract, Load, Transform) challenges of ingesting SaaS data into data warehouses like Snowflake. Additionally, the complexity of data sources has increased with the proliferation of cloud-based data storage services (e.g., AWS Aurora, AWS RDS, MongoDB Service) and the coexistence of traditional on-premises databases and software (SAP, Oracle, DB2) in hybrid cloud architectures. Traditional ETL and ELT architectures are unable to cope with the intricacies of processing data in such a complex environment.
In modern data architecture, the emergence of data lakes has combined the features of traditional ODS (Operational Data Store) and data warehouses. Data lakes enable real-time data processing and facilitate data changes at the source (e.g., Apache Hudi, Apache Iceberg, Databricks Delta Lake). Simultaneously, the concept of real-time data warehouses has surfaced, with various new computing engines (e.g., Apache Pinot, ClickHouse, Apache Doris) making real-time ETL a priority. However, traditional CDC ETL tools or real-time stream processing platforms face challenges in providing adequate support for data lakes and real-time data warehouses, either due to compatibility issues with new storage engines or limitations in connecting to modern data sources, lacking robust architecture and tool support.
In modern data architecture, a new breed of architectures has emerged, aiming to minimize data movement across different data stores and enabling complex queries directly through connectors or rapid data loading. Examples include Starburst’s TrinoDB (formerly PrestoDB) and OneHouse based on Apache Hudi. These tools excel at data caching and on-the-fly cross-data-source queries, making them unsuitable for support by traditional ETL/ELT tools in this new Big Data Federation paradigm.
With the emergence of ChatGPT in 2022, AI models have become algorithmically feasible for widespread enterprise applications. The bottleneck for AI application deployment now lies in data supply, which has been addressed by data lakes and Big Data Federation for data storage and querying. However, traditional ETL, ELT, and stream processing have become bottlenecks for data supply, either unable to quickly integrate various complex traditional and emerging data sources or failing to support diverse data requirements for both AI training and online AI applications using a single codebase.
As data-driven approaches become more ingrained in enterprises, the number of data users within organizations has rapidly increased. These users range from traditional data engineers to data analysts, AI practitioners, sales analysts, and financial analysts, each with diverse data requirements. After experiencing various shifts like No-SQL and New-SQL, SQL has emerged as the sole standard for complex business analysis. A considerable number of analysts and business unit engineers now use SQL to address the “last mile” problem of data analysis within enterprises. Meanwhile, the handling of complex unstructured data is left to professional data engineers using technologies like Spark, MapReduce, and Flink. Consequently, the demands of these two groups diverge significantly, making traditional ETL and ELT architectures inadequate to meet the needs of modern enterprise users.
Against the backdrop mentioned above, data processing has gradually evolved into the EtLT architecture:
EtLT Architecture Overview:
EtLT splits the traditional ETL and ELT structures and combines real-time and batch processing to accommodate real-time data warehouses and AI application requirements.
In the EtLT architecture, different user roles have distinct responsibilities:
There are several open-source implementations of EtLT in modern data architecture. Examples include DBT, which helps analysts and business developers quickly develop data applications based on Snowflake, and Apache DolphinScheduler, a visual workflow orchestration tool for big data tasks. DolphinScheduler plans to introduce a Task IDE, allowing data analysts to directly debug SQL tasks for Hudi, Hive, Presto, ClickHouse, and more, and create workflow tasks through drag-and-drop.
As a representative of the EtLT architecture, Apache SeaTunnel started with support for various cloud and on-premises data sources, gradually expanding its capabilities to include SaaS and Reverse ETL, as well as accommodating the demands of large-scale model data supply. It has been continually refining the EtLT landscape. The latest SeaTunnel Zeta computing engine delegates complex operations such as Join and Groupby to the ultimate data warehouse endpoint, focusing on data normalization and standardization. This approach aims to achieve the goal of unified real-time and batch data processing with a single set of code and a high-performance engine. Additionally, SeaTunnel now includes support for large-scale models, making it possible for these models to directly interact with over 100 supported data sources (refer to https://seatunnel.apache.org/docs/Connector-v2-release-state/), ranging from traditional databases to cloud databases and ultimately SaaS.
Since joining the Apache Incubator in late 2022, Apache SeaTunnel has witnessed a five-fold growth in one year, and currently, it supports more than 100 data sources (as seen at https://seatunnel.apache.org/docs/2.3.2/category/source-v2). The connector support has been progressively improved, encompassing traditional databases, cloud databases, and SaaS offerings.
The release of SeaTunnel Zeta Engine in Apache SeaTunnel 2.3.0 brings features such as data distributed CDC, schema evolution for target source data tables, and the synchronization of entire databases and multiple tables. Its excellent performance has garnered attention from numerous global users, including Bharti Airtel, the second-largest telecommunications operator in India, Shopee.com, an e-commerce platform in Singapore, and Vip.com, a major online retailer.
One noteworthy aspect is that SeaTunnel now offers support for large-scale model training and vector databases, enabling seamless interactions between large models and the 100+ supported data sources of SeaTunnel (see the article Breakthrough in the book search field! Use Apache SeaTunnel, Milvus, and OpenAI to improve the accuracy and efficiency of book title similarity search). Furthermore, SeaTunnel can leverage ChatGPT to directly generate SaaS Connectors, facilitating rapid access to a wide range of internet information for your large-scale models and data warehouses.
As the complexity of AI, cloud, and SaaS continues to increase, the demand for real-time CDC, SaaS, data lakes, and real-time data warehouse loading has made simple ETL architectures inadequate to meet the needs of modern enterprises. EtLT architecture, tailored for different stages of enterprise development, is destined to shine in the modern data infrastructure. With the mission of “Connecting the world’s data sources and synchronizing them as swiftly as flying”, Apache SeaTunnel warrants the attention of all data professionals.
Also published here.