Listen to this story
William Guo, WhaleOps CEO, Apache Software Foundation Member
Last year witnessed the explosive rise of large models, generating global enthusiasm and making AI seem like a solution to all problems. This year, as the hype subsides, large models have entered a deeper phase, aiming to reshape the foundational logic of various industries. In big data processing, the collision between large models and traditional ETL (Extract, Transform, Load) processes has sparked new debates. Large models feature “Transformers,” while ETL relies on “Transform” processes—similar names representing vastly different paradigms. Some voices boldly predict: "ETL will be completely replaced in the future, as large models can handle all data!" Does this signal the end of the decades-old ETL framework underpinning data processing? Or is it merely a misunderstood prediction? Behind this conflict lies a deeper contemplation of technology's future.
With the rapid development of large models, some have begun to speculate whether traditional big data processing methods, including ETL, are still necessary. Large models, capable of autonomously learning rules and discovering patterns from vast datasets, are undeniably impressive. However, my answer is clear: ETL will not disappear. Large models still fail to address several core data challenges:
Despite their outstanding performance in specific tasks, large models incur enormous computational costs. Training a large-scale Transformer model may take weeks and consume vast amounts of energy and financial resources. By contrast, ETL, which relies on predefined rules and logic, is efficient, resource-light, and excels at processing structured data.
For everyday enterprise data tasks, many operations remain rule-driven, such as:
These tasks can be swiftly handled by ETL tools without requiring the complex inference capabilities of large models.
Large models have excelled in natural language processing (NLP) but have also exposed inherent challenges—ambiguity and vagueness in human language. For example:
By contrast, ETL is deterministic, processing data based on pre-defined rules to produce predictable, standardized outputs. In high-demand sectors like finance and healthcare, ETL's reliability and precision remain critical advantages.
Large models are adept at extracting insights from unstructured data (e.g., text, images, videos), but they often struggle with structured data tasks. For instance:
In scenarios dominated by structured data (e.g., tables, JSON), ETL remains the optimal choice.
Large models are often referred to as “black boxes.” Even when data processing is complete, their internal workings and decision-making mechanisms remain opaque:
ETL, in contrast, provides highly transparent processes, with every data handling step documented and auditable, ensuring compliance with corporate and industry standards.
Large models are highly sensitive to data quality. Noise, anomalies, or non-standardized inputs can severely affect their performance:
ETL ensures data is cleaned, deduplicated, and standardized before being fed into large models, maintaining high data quality.
Despite the excellence of large models in many areas, their complexity, reliance on high-quality data, hardware demands, and practical limitations ensure they cannot entirely replace ETL. As a deterministic, efficient, and transparent tool, ETL will continue to coexist with large models, providing dual safeguards for data processing.
While ETL cannot be replaced, the rise of large models in data processing is an inevitable trend. For decades, computing systems were CPU-centric, with other components considered peripherals. GPUs were primarily used for gaming, but today, data processing relies on the synergy of CPUs and GPUs (or NPUs). This paradigm shift reflects broader changes, mirrored in the stock trends of Intel and NVIDIA.
Historically, data processing architectures evolved from "CPU-centric" to "CPU+GPU (and even NPU) collaboration." This transition, driven by changes in computing performance requirements, has deeply influenced the choice of data processing tools.
During the CPU-centric era, early ETL processes heavily relied on CPU logic for operations like data cleaning, formatting, and aggregation. These tasks were well-suited to CPUs’ sequential processing capabilities.
However, the rise of complex data formats (audio, video, text) and exponential storage growth revealed the limitations of CPU power. GPUs, with their unparalleled parallel processing capabilities, have since taken center stage in data-intensive tasks like training large Transformer models.
Traditional ETL processes, optimized for "CPU-centric" computing, excel at handling rule-based, structured data tasks. Examples include:
Large models, in contrast, require GPU power for high-dimensional matrix computations and large-scale parameter optimization:
This reflects a shift from logical computation to neural inference, broadening data processing to include reasoning and knowledge extraction.
The rise of large models highlights inefficiencies in traditional data processing, necessitating a more advanced, unified architecture.
Future ETL tools will embed AI capabilities, merging traditional strengths with modern intelligence:
With the continuous advancement of technology, large models and traditional ETL processes are gradually converging. The next generation of ETL architectures is expected to blend the intelligence of large models with the efficiency of ETL, creating a comprehensive framework capable of processing diverse data types.
The foundation of data processing is shifting from CPU-centric systems to a collaborative approach involving CPUs and GPUs:
This trend is reflected not only in technical innovation but also in industry dynamics: Intel is advancing AI accelerators for CPU-AI collaboration, while NVIDIA is expanding GPU applications into traditional ETL scenarios. The synergy between CPUs and GPUs promises higher efficiency and intelligent support for next-generation data processing.
As ETL and large model functionalities become increasingly intertwined, data processing is evolving into a multifunctional, collaborative platform where ETL serves as a data preparation tool for large models.
Large models require high-quality input data during training, and ETL provides the preliminary processing to create ideal conditions:
The future of ETL tools lies in embedding AI capabilities to achieve smarter data processing:
AI-enhanced ETL represents a transformative leap from traditional ETL, offering embedding generation, LLM-based knowledge extraction, unstructured data processing, and dynamic rule generation to significantly improve efficiency, flexibility, and intelligence in data processing.
As an example, the open-source Apache SeaTunnel project is breaking traditional ETL limitations by supporting innovative data formats and advanced processing capabilities, showcasing the future of data processing:
Tools like SeaTunnel illustrate how modern data processing has evolved into an AI+Big Data full-stack collaboration system, becoming central to enterprise AI and data strategies.
Large model transformers and big data transforms are not competitors but allies. The future of data processing lies in the deep integration of ETL and large models, as illustrated below:
The convergence of large models and ETL will propel data processing into a new era of intelligence, standardization, and openness. By addressing enterprise demands, this evolution will drive business innovation and intelligent decision-making, becoming a core engine for the future of data-driven enterprises.
William Guo is a recognized leader in the big data and open-source communities. He is a Member of the Apache Software Foundation, serving as the PMC for Apache DolphinScheduler and a mentor for Apache SeaTunnel. He has also been a Track Chair for Workflow/Data Governance at ApacheCon Asia (2021/2022/2023) and a speaker at ApacheCon North America.
With over 20 years of experience in big data technology and data management, William has held senior leadership roles, including Chief Technology Officer at Analysys, Senior Big Data Director at Lenovo, and Big Data Director/Manager at Teradata, IBM, and CICC. His extensive experience is focused on data warehousing, ETL processes, and data governance. He has developed and managed large-scale data systems, guiding enterprises through complex data integration and governance challenges, and demonstrating a strong track record in leading open-source initiatives and shaping enterprise-level big data strategies.