From Hours to Minutes: How Dmall Cuts Data Integration Costs to 1/3 with Apache SeaTunnel?

Dmall is a global provider of intelligent retail solutions, supporting the digital transformation of over 430 clients. With rapid business expansion, data synchronization's real-time nature, resource efficiency, and development flexibility have become the three key challenges we must overcome. Four Stages of Dmall's Data Platform Evolution Dmall's data platform has undergone four major transformations, always focusing on "faster, more efficient, and more stable." In the process of building the data platform, we initially used AWS-EMR to quickly establish cloud-based big data capabilities, then reverted to IDC self-built Hadoop clusters, combining open-source cores with self-developed integration, scheduling, and development components, transforming heavy assets into reusable light services. As the business required lower costs and higher elasticity, the team rebuilt the foundation with storage-compute separation and containerization, introducing Apache SeaTunnel for real-time data lake integration. Subsequently, with Apache Iceberg and Paimon as unified storage formats, we formed a new architecture for lakehouse integration, providing a stable, low-cost data foundation for AI and completing the transition from cloud adoption to cloud creation and from offline to real-time. AWS-EMR IDC self-built Hadoop clusters Apache SeaTunnel Apache Iceberg and Paimon Storage-Compute Separation Architecture Dmall UniData's (Data IDE) storage-compute separation architecture uses Kubernetes as the elastic foundation, with Spark, Flink, and StarRocks scaling on demand. Iceberg + JuiceFS unifies lake storage, Hive Metastore manages cross-cloud metadata, and Ranger provides fine-grained access control. This architecture is vendor-neutral and fully controllable across the entire tech stack. The business benefits are clear: TCO reduced by 40-75%, resource scaling in seconds, the same IDE framework covering integration, scheduling, modeling, querying, and service delivery quickly, with fewer resources and seamless multi-cloud security. I. Pain Points of the Old Architecture Before introducing Apache SeaTunnel, Dmall's data platform supported over a dozen storage self-service data synchronization sources like MySQL, Hive, and ES, using Spark’s self-developed solutions for various data sources, customized to connect on demand, but only supported batch processing. In terms of data import, Dmall's data platform unified ODS data into the data lake, using Apache Iceberg as the lakehouse format, with hourly data downstream being available, ensuring high data reuse and quality. Previously, we relied on Spark's self-developed synchronization tools, which were stable but suffered from issues like “slow startup, high resource usage, and difficult scalability.” “It's not that Spark is bad, but it’s too heavy.” “It's not that Spark is bad, but it’s too heavy.” Against the backdrop of cost reduction and efficiency improvement, we re-evaluated the original data integration architecture. While Spark's batch jobs were mature, they were overkill for medium-sized data synchronization tasks. Slow startup, high resource consumption, and long development cycles became bottlenecks for the team's efficiency. More importantly, with the growing demand for real-time business needs, Spark's batch processing model was becoming unsustainable. DimensionOld Spark SolutionBusiness ImpactHigh resources2C8G start, idle 40sNot friendly for medium and small-scale data synchronizationHigh developmentNo abstracted Source/Sink, full-stack developmentFull-stack development increased development and maintenance costs, lowered delivery efficiencyDoes not support real-time syncGrowing real-time incremental synchronization needsStill relying on developers to implement using Java/FlinkLimited data sourcesIncreased private cloud deployments and diverse data sourcesDifficulty in quickly meeting business needs for new data source development DimensionOld Spark SolutionBusiness ImpactHigh resources2C8G start, idle 40sNot friendly for medium and small-scale data synchronizationHigh developmentNo abstracted Source/Sink, full-stack developmentFull-stack development increased development and maintenance costs, lowered delivery efficiencyDoes not support real-time syncGrowing real-time incremental synchronization needsStill relying on developers to implement using Java/FlinkLimited data sourcesIncreased private cloud deployments and diverse data sourcesDifficulty in quickly meeting business needs for new data source development DimensionOld Spark SolutionBusiness Impact DimensionOld Spark SolutionBusiness Impact Dimension Old Spark Solution Business Impact High resources2C8G start, idle 40sNot friendly for medium and small-scale data synchronizationHigh developmentNo abstracted Source/Sink, full-stack developmentFull-stack development increased development and maintenance costs, lowered delivery efficiencyDoes not support real-time syncGrowing real-time incremental synchronization needsStill relying on developers to implement using Java/FlinkLimited data sourcesIncreased private cloud deployments and diverse data sourcesDifficulty in quickly meeting business needs for new data source development High resources2C8G start, idle 40sNot friendly for medium and small-scale data synchronization High resources 2C8G start, idle 40s Not friendly for medium and small-scale data synchronization High developmentNo abstracted Source/Sink, full-stack developmentFull-stack development increased development and maintenance costs, lowered delivery efficiency High development No abstracted Source/Sink, full-stack development Full-stack development increased development and maintenance costs, lowered delivery efficiency Does not support real-time syncGrowing real-time incremental synchronization needsStill relying on developers to implement using Java/Flink Does not support real-time sync Growing real-time incremental synchronization needs Still relying on developers to implement using Java/Flink Limited data sourcesIncreased private cloud deployments and diverse data sourcesDifficulty in quickly meeting business needs for new data source development Limited data sources Increased private cloud deployments and diverse data sources Difficulty in quickly meeting business needs for new data source development That was until we encountered Apache SeaTunnel, and everything started to change. II. Why SeaTunnel? “We’re not choosing a tool; we’re choosing the foundation for the next five years of data integration.” “We’re not choosing a tool; we’re choosing the foundation for the next five years of data integration.” Facing diverse data sources, real-time needs, and resource optimization pressures, we needed a “batch-stream unified, lightweight, efficient, and easily scalable” integration platform. SeaTunnel, with its open-source nature, multi-engine support, rich connectors, and active community, became our final choice. It not only solved Spark’s “heavy” issue but also laid the foundation for lakehouse integration and real-time analytics in the future. Engine Neutrality: Built-in Zeta, compatible with Spark/Flink, automatically switching based on data volume.200+ connectors: Plugin-based; new data sources require only JSON configuration, no Java code.Batch and stream unified: One configuration supports full, incremental, and CDC.Active community: GitHub 8.8k stars, 30+ PR merges weekly, with 5 patches we contributed merged within 7 days. Engine Neutrality: Built-in Zeta, compatible with Spark/Flink, automatically switching based on data volume. Engine Neutrality: 200+ connectors: Plugin-based; new data sources require only JSON configuration, no Java code. 200+ connectors: Batch and stream unified: One configuration supports full, incremental, and CDC. Batch and stream unified: Active community: GitHub 8.8k stars, 30+ PR merges weekly, with 5 patches we contributed merged within 7 days. Active community: III. New Platform Architecture: Making SeaTunnel "Enterprise-Grade" “Open-source doesn’t just mean using it as-is, but standing on the shoulders of giants to continue building.” “Open-source doesn’t just mean using it as-is, but standing on the shoulders of giants to continue building.” While SeaTunnel is powerful, to truly apply it in enterprise-level scenarios, we needed an "outer shell"—unified management, scheduling, permissions, rate-limiting, monitoring, etc. We built a set of visual, configurable, and scalable data integration platforms around SeaTunnel, transforming it from an open-source tool into the "core engine" of Dmall's data platform. 3.1 Global Architecture Using Apache SeaTunnel as the foundation, the platform exposes a unified REST API, allowing external systems like Web UI, Merchant Exchange, and MCP services to call with one click; built-in connector template center allows new storage to be published in minutes with parameter filling, no coding required. The scheduling layer supports mainstream orchestration like Apache DolphinScheduler, Airflow, etc. The engine layer intelligently routes Zeta/Flink/Spark based on data volume, allowing lightweight fast tasks for small jobs and distributed parallel processing for large jobs. The environment is fully cloud-native, supporting K8s, Yarn, and Standalone modes, making it easy to deliver in private cloud scenarios, ensuring "template-as-a-service, engine-switchable, deployment-unbound." 3.2 Data Integration Features Data Source Registration: One-time entry of address, account, and password, with sensitive fields encrypted. Public data sources like Hive are visible to all tenants.Connector Templates: Add connectors by configuration and define SeaTunnel config generation rules, controlling task interface Source and Sink display.Offline Tasks: Run batch tasks supporting Zeta and Spark engines, describe synchronization tasks using DAG diagrams, and support wildcard variable injection.Real-Time Tasks: Run stream tasks supporting Zeta and Flink engines, storing checkpoints via S3 protocol for CDC incremental synchronization. Data Source Registration: One-time entry of address, account, and password, with sensitive fields encrypted. Public data sources like Hive are visible to all tenants. Data Source Registration: Connector Templates: Add connectors by configuration and define SeaTunnel config generation rules, controlling task interface Source and Sink display. Connector Templates: Offline Tasks: Run batch tasks supporting Zeta and Spark engines, describe synchronization tasks using DAG diagrams, and support wildcard variable injection. Offline Tasks: Real-Time Tasks: Run stream tasks supporting Zeta and Flink engines, storing checkpoints via S3 protocol for CDC incremental synchronization. Real-Time Tasks: 3.3 Integration Features Access Application: Users submit synchronization table requests for approval to ensure data quality.Database Table Management: Sync by database, avoiding excessive sync paths; unified management guarantees data quality and supports merging tables.Base Pulling: Automatic table creation and initialization using batch tasks; large tables split as needed and data gaps filled based on conditions.Data Synchronization: Submit synchronization tasks to the cluster via REST API; supports rate-limiting and tagging features to guarantee important syncs; CDC incremental writing to multiple lakehouses. Access Application: Users submit synchronization table requests for approval to ensure data quality. Access Application: Database Table Management: Sync by database, avoiding excessive sync paths; unified management guarantees data quality and supports merging tables. Database Table Management: Base Pulling: Automatic table creation and initialization using batch tasks; large tables split as needed and data gaps filled based on conditions. Base Pulling: Data Synchronization: Submit synchronization tasks to the cluster via REST API; supports rate-limiting and tagging features to guarantee important syncs; CDC incremental writing to multiple lakehouses. Data Synchronization: IV. Secondary Development: Let SeaTunnel Speak "Dmall Dialect" “No matter how excellent the open-source project, it still can't understand your business 'dialect'.” “No matter how excellent the open-source project, it still can't understand your business 'dialect'.” SeaTunnel’s plugin mechanism is flexible, but it still requires us to “modify the code” to meet Dmall's custom requirements such as DDH message formats, sharding and merging tables, and dynamic partitioning. Fortunately, SeaTunnel's modular design makes secondary development efficient and controllable. Below are some key modules we've modified, each directly addressing a business pain point. 4.1 Custom DDH-Format CDC Dmall has developed DDH to collect MySQL binlogs and push them to Kafka using Protobuf. We implemented the following: KafkaDeserializationSchema:Parse Protobuf → SeaTunnelRow;DDL messages directly construct CatalogTable and automatically add columns in Paimon;Mark DML as "before/after," enabling partial column updates in downstream StarRocks. KafkaDeserializationSchema: KafkaDeserializationSchema: Parse Protobuf → SeaTunnelRow; DDL messages directly construct CatalogTable and automatically add columns in Paimon; Mark DML as "before/after," enabling partial column updates in downstream StarRocks. 4.2 Router Transform: Multi-Table Merging and Dynamic Partitioning Scenario: Merging 1200 sharded tables t_order_00…t_order_1199 into a single Paimon table dwd_order.Implementation:Use regular expressions t_order_(\d+) to map to the target table;Select the benchmark schema (the one with the most fields), and for missing fields in other tables, insert NULL;Generate new UK using $table_name + $pk for primary key conflicts;Extract the partition field dt from the string create_time, supporting both yyyy-MM-dd and yyyyMMdd formats for automatic recognition. Scenario: Merging 1200 sharded tables t_order_00…t_order_1199 into a single Paimon table dwd_order. Scenario: t_order_00…t_order_1199 dwd_order Implementation: Implementation: Use regular expressions t_order_(\d+) to map to the target table; t_order_(\d+) Select the benchmark schema (the one with the most fields), and for missing fields in other tables, insert NULL; NULL Generate new UK using $table_name + $pk for primary key conflicts; $table_name + $pk Extract the partition field dt from the string create_time, supporting both yyyy-MM-dd and yyyyMMdd formats for automatic recognition. dt create_time yyyy-MM-dd yyyyMMdd 4.3 Hive-Sink Support for Overwrite The community version only supports append. Based on PR #7843, we modified SeaTunnel to support the overwrite feature: append Before submitting the task, we call FileSystem.listStatus() to get the old paths based on partition values;After writing new data, we atomically delete the old paths to achieve "idempotent re-run." Before submitting the task, we call FileSystem.listStatus() to get the old paths based on partition values; FileSystem.listStatus() After writing new data, we atomically delete the old paths to achieve "idempotent re-run." This improvement has been contributed back to the community and is expected to be released in version 2.3.14. 4.4 Other Patches JuiceFS Connector: Supports mount point caching, improving listing performance by 5x.Kafka 2.x Independent Module: Resolves protocol conflicts between versions 0.10 and 2.x.Upgrade to JDK 11: Reduces garbage collection time by 40% in the Zeta engine.New JSON UDFs: Added json_extract_array/json_merge and date-related UDF date_shift(), all merged into the main branch. JuiceFS Connector: Supports mount point caching, improving listing performance by 5x. JuiceFS Connector: Kafka 2.x Independent Module: Resolves protocol conflicts between versions 0.10 and 2.x. Kafka 2.x Independent Module: Upgrade to JDK 11: Reduces garbage collection time by 40% in the Zeta engine. Upgrade to JDK 11: New JSON UDFs: Added json_extract_array/json_merge and date-related UDF date_shift(), all merged into the main branch. New JSON UDFs: json_extract_array json_merge date_shift() V. Pitfalls: Our Real-World Challenges “Every pitfall is a necessary step toward stability.” “Every pitfall is a necessary step toward stability.” No matter how mature an open-source project is, it’s inevitable to encounter pitfalls when deploying in real business scenarios. During our use of SeaTunnel, we faced version conflicts, asynchronous operations, and consumption delays. Below are some typical "pits" we encountered and the solutions that helped us avoid them. ProblemPhenomenonRoot CauseSolutionS3 Access FailureSpark 3.3.4 conflicts with SeaTunnel's default Hadoop 3.1.4Two versions of aws-sdk in classpathExclude Spark’s hadoop-client, use SeaTunnel's uber jarStarRocks ALTER BlockedWrite fails with “column not found”ALTER in SR is asynchronous; clients continue writing and failPoll SHOW ALTER TABLE STATE in the sink, resume writing after the FINISHEDstatusSlow Kafka ConsumptionOnly 3k messages per secondPolling thread sleeps 100ms on empty messagesContributed PR #7821, added "no sleep on empty polling" mode, increasing throughput to 120k/s ProblemPhenomenonRoot CauseSolutionS3 Access FailureSpark 3.3.4 conflicts with SeaTunnel's default Hadoop 3.1.4Two versions of aws-sdk in classpathExclude Spark’s hadoop-client, use SeaTunnel's uber jarStarRocks ALTER BlockedWrite fails with “column not found”ALTER in SR is asynchronous; clients continue writing and failPoll SHOW ALTER TABLE STATE in the sink, resume writing after the FINISHEDstatusSlow Kafka ConsumptionOnly 3k messages per secondPolling thread sleeps 100ms on empty messagesContributed PR #7821, added "no sleep on empty polling" mode, increasing throughput to 120k/s ProblemPhenomenonRoot CauseSolution ProblemPhenomenonRoot CauseSolution Problem Phenomenon Root Cause Solution S3 Access FailureSpark 3.3.4 conflicts with SeaTunnel's default Hadoop 3.1.4Two versions of aws-sdk in classpathExclude Spark’s hadoop-client, use SeaTunnel's uber jarStarRocks ALTER BlockedWrite fails with “column not found”ALTER in SR is asynchronous; clients continue writing and failPoll SHOW ALTER TABLE STATE in the sink, resume writing after the FINISHEDstatusSlow Kafka ConsumptionOnly 3k messages per secondPolling thread sleeps 100ms on empty messagesContributed PR #7821, added "no sleep on empty polling" mode, increasing throughput to 120k/s S3 Access FailureSpark 3.3.4 conflicts with SeaTunnel's default Hadoop 3.1.4Two versions of aws-sdk in classpathExclude Spark’s hadoop-client, use SeaTunnel's uber jar S3 Access Failure S3 Access Failure Spark 3.3.4 conflicts with SeaTunnel's default Hadoop 3.1.4 Two versions of aws-sdk in classpath aws-sdk Exclude Spark’s hadoop-client, use SeaTunnel's uber jar hadoop-client StarRocks ALTER BlockedWrite fails with “column not found”ALTER in SR is asynchronous; clients continue writing and failPoll SHOW ALTER TABLE STATE in the sink, resume writing after the FINISHEDstatus StarRocks ALTER Blocked StarRocks ALTER Blocked Write fails with “column not found” ALTER in SR is asynchronous; clients continue writing and fail Poll SHOW ALTER TABLE STATE in the sink, resume writing after the FINISHEDstatus SHOW ALTER TABLE STATE FINISHED Slow Kafka ConsumptionOnly 3k messages per secondPolling thread sleeps 100ms on empty messagesContributed PR #7821, added "no sleep on empty polling" mode, increasing throughput to 120k/s Slow Kafka Consumption Slow Kafka Consumption Only 3k messages per second Polling thread sleeps 100ms on empty messages Contributed PR #7821, added "no sleep on empty polling" mode, increasing throughput to 120k/s VI. Summary of Benefits: Delivering in Three Months “Technical value must ultimately be demonstrated with numbers.” “Technical value must ultimately be demonstrated with numbers.” In less than three months of using Apache SeaTunnel, we completed the migration of three merchant production environments. Not only did it “run faster,” but it also “ran cheaper.” With support for Oracle, cloud storage, Paimon, and StarRocks, we covered all source-side needs, and real-time synchronization is no longer dependent on hand-written Flink code. The template-based, "zero-code" connector integration reduced the development time from several weeks to just 3 days. Resource consumption dropped to only 1/3 of the original Spark solution, with the same data volume running lighter and faster. With a new UI and on-demand data source permissions, merchant IT teams can now configure tasks and monitor data flows, reducing delivery costs and improving user experience—fulfilling the three key goals of cost reduction, flexibility, and stability. VII. Next Steps: Lakehouse + AI Dual-Drive “Data integration is not the end, but the beginning of intelligent analysis.” “Data integration is not the end, but the beginning of intelligent analysis.” Apache SeaTunnel helped us solve the problems of fast and cost-effective data transfer. Next, we need to solve the challenges of accurate and intelligent data transfer. As technologies like Paimon, StarRocks, and LLM mature, we are building a "real-time lakehouse + AI-driven" data platform, enabling data not only to be visible but also to be usable with precision. In the future, Dmall will write “real-time” and “intelligent” into the next line of code for its data platform: Lakehouse Upgrade: Fully integrate Paimon + StarRocks, reducing ODS data lake latency from hours to minutes, providing merchants with near-real-time data.AI Ready: Use MCP services to call LLM to auto-generate synchronization configurations, and introduce vectorized execution engines to create pipelines directly consumable by AI training, enabling "zero-code, intelligent" data integration.Community Interaction: Track SeaTunnel's main version updates, introduce performance optimizations, and contribute internal improvements as PRs to the community, forming a closed loop of “use-improve-open-source” and continuously amplifying the technical dividend. Lakehouse Upgrade: Fully integrate Paimon + StarRocks, reducing ODS data lake latency from hours to minutes, providing merchants with near-real-time data. Lakehouse Upgrade: AI Ready: Use MCP services to call LLM to auto-generate synchronization configurations, and introduce vectorized execution engines to create pipelines directly consumable by AI training, enabling "zero-code, intelligent" data integration. AI Ready: Community Interaction: Track SeaTunnel's main version updates, introduce performance optimizations, and contribute internal improvements as PRs to the community, forming a closed loop of “use-improve-open-source” and continuously amplifying the technical dividend. Community Interaction: VIII. A Message to My Peers “If you're also struggling with the 'heavy' and 'slow' data synchronization, give SeaTunnel a sprint’s worth of time.” “If you're also struggling with the 'heavy' and 'slow' data synchronization, give SeaTunnel a sprint’s worth of time.” In just 3 months, we reduced data integration costs to 1/3, improved real-time performance from hourly to minute-level, and compressed development cycles from weeks to days. SeaTunnel is not a silver bullet, but it is light, fast, and open enough. As long as you're willing to get hands-on, it can become the "new engine" for your data platform.