What Is CDC? Change Data Capture (CDC) is a mechanism that tracks row-level changes (inserts, updates, deletes) in a database and notifies downstream systems in the order they occur. In disaster recovery scenarios, CDC is often used for real-time synchronization from a primary database to a standby one. Change Data Capture (CDC) source ----------> CDC ----------> sink source ----------> CDC ----------> sink Apache SeaTunnel CDC SeaTunnel CDC supports two types of synchronization: Snapshot Reading: Reads historical data from tables
Incremental Tracking: Captures and reads incremental change logs from tables Snapshot Reading: Reads historical data from tables Snapshot Reading Incremental Tracking: Captures and reads incremental change logs from tables Incremental Tracking Lock-Free Snapshot Sync Why emphasize “lock-free”? Because some CDC platforms (e.g., Debezium) may lock tables during historical data sync. SeaTunnel avoids this by reading snapshots without locking. Here’s the basic snapshot read process: storage------------->splitEnumerator----------split---------->reader
                            ^                                   |
                            |                                   |
                            \-----------------report------------/ storage------------->splitEnumerator----------split---------->reader
                            ^                                   |
                            |                                   |
                            \-----------------report------------/ Split Assignment: The splitEnumerator divides table data into multiple chunks (splits) based on a specified field (e.g., table ID or unique key) and step size. Split Assignment splitEnumerator Parallel Processing: Each split is routed to a different reader for parallel reading. Each reader occupies one connection. Parallel Processing Event Feedback: After completing a split, the reader reports progress back to the splitEnumerator. Event Feedback splitEnumerator Each split sent to a reader contains metadata: String              splitId         // Routing ID  
TableId             tableId         // Table ID  
SeatunnelRowType    splitKeyType    // Field type used for splitting  
Object              splitStart      // Split start point  
Object              splitEnd        // Split end point String              splitId         // Routing ID  
TableId             tableId         // Table ID  
SeatunnelRowType    splitKeyType    // Field type used for splitting  
Object              splitStart      // Split start point  
Object              splitEnd        // Split end point The reader generates SQL statements based on this info. Before reading, it records the log position (low watermark) for the split. Once complete, it reports: low watermark String      splitId         // Split ID  
Offset      highWatermark   // Log position after split is processed String      splitId         // Split ID  
Offset      highWatermark   // Log position after split is processed Incremental Synchronization After snapshot reading, any changes in the source DB are continuously captured and synced in real time. Unlike the snapshot phase, this stage reads from the DB’s log (e.g., MySQL binlog) and typically uses a single-threaded reader to reduce pressure on the DB. single-threaded reader data log------------->splitEnumerator----------split---------->reader
                            ^                                   |
                            |                                   |
                            \-----------------report------------/ data log------------->splitEnumerator----------split---------->reader
                            ^                                   |
                            |                                   |
                            \-----------------report------------/ During incremental sync, all snapshot splits and tables are merged into a single split. Metadata for the incremental split: String                              splitId  
Offset                              startingOffset                  // Earliest log start across all splits  
Offset                              endingOffset                    // Log end position; null if continuous  
List<TableId>                       tableIds  
Map<TableId, Offset>                tableWatermarks                 // Watermarks per table from snapshot phase  
List<CompletedSnapshotSplitInfo>    completedSnapshotSplitInfos     // Snapshot split details String                              splitId  
Offset                              startingOffset                  // Earliest log start across all splits  
Offset                              endingOffset                    // Log end position; null if continuous  
List<TableId>                       tableIds  
Map<TableId, Offset>                tableWatermarks                 // Watermarks per table from snapshot phase  
List<CompletedSnapshotSplitInfo>    completedSnapshotSplitInfos     // Snapshot split details CompletedSnapshotSplitInfo includes: CompletedSnapshotSplitInfo String              splitId  
TableId             tableId  
SeatunnelRowType    splitKeyType  
Object              splitStart  
Object              splitEnd  
Offset              watermark       // High watermark from report String              splitId  
TableId             tableId  
SeatunnelRowType    splitKeyType  
Object              splitStart  
Object              splitEnd  
Offset              watermark       // High watermark from report The incremental phase finds the smallest watermark from snapshot splits and starts reading from that log position. smallest watermark Exactly-Once Guarantee Whether during snapshot or incremental sync, the database may still be changing. How does SeaTunnel ensure exactly-once processing? exactly-once Snapshot Phase During snapshot reading, imagine a split is being read and data changes (e.g., insert k3, update k2, delete k1). Without special handling, updates could be missed. SeaTunnel solves this by: Reading the low watermark from the database log before the split
Reading data for split {start, end}
Recording the high watermark after the split
If high = low, no change occurred. If high > low, a change occurred during the read. SeaTunnel:

Caches snapshot data in-memory
Replays log events between low and high watermark onto the in-memory table using primary key order


Reports high watermark Reading the low watermark from the database log before the split low watermark Reading data for split {start, end} {start, end} Recording the high watermark after the split high watermark If high = low, no change occurred. If high > low, a change occurred during the read. SeaTunnel:

Caches snapshot data in-memory
Replays log events between low and high watermark onto the in-memory table using primary key order high = low high > low Caches snapshot data in-memory
Replays log events between low and high watermark onto the in-memory table using primary key order Caches snapshot data in-memory Replays log events between low and high watermark onto the in-memory table using primary key order Reports high watermark high watermark insert k3      update k2      delete k1
                |               |               |
                v               v               v
 bin log --|---------------------------------------------------|-- log offset
      low watermark                                     high watermark

CDC read data:  k1 k3 k4  
                    | replay
                    v  
Final result:    k2 k3' k4 insert k3      update k2      delete k1
                |               |               |
                v               v               v
 bin log --|---------------------------------------------------|-- log offset
      low watermark                                     high watermark

CDC read data:  k1 k3 k4  
                    | replay
                    v  
Final result:    k2 k3' k4 Incremental Phase Before starting incremental sync, SeaTunnel validates the snapshot phase. It checks for inter-split changes—data changes that may have occurred between splits. SeaTunnel handles this by: inter-split changes Finding the minimum watermark from all snapshot splits
Starting to read from that log position
For each log event, checking if the data was already processed in a snapshot split
If not, it's inter-split data and will be corrected
After all tables are validated, true incremental sync begins Finding the minimum watermark from all snapshot splits Starting to read from that log position For each log event, checking if the data was already processed in a snapshot split If not, it's inter-split data and will be corrected After all tables are validated, true incremental sync begins |------------filter split2-----------------|
          |----filter split1------|                  
data log -|-----------------------|------------------|----------------------------------|- log offset
        min watermark      split1 watermark    split2 watermark                    max watermark |------------filter split2-----------------|
          |----filter split1------|                  
data log -|-----------------------|------------------|----------------------------------|- log offset
        min watermark      split1 watermark    split2 watermark                    max watermark Fault Tolerance and Checkpointing How to support pause and resume? SeaTunnel uses the Chandy-Lamport distributed snapshot algorithm. Chandy-Lamport distributed snapshot algorithm Assume two processes, p1 and p2: p1 has variables X1 Y1 Z1, p2 has X2 Y2 Z2: p1                                  p2
X1:0                                X2:4  
Y1:0                                Y2:2  
Z1:0                                Z2:3 p1                                  p2
X1:0                                X2:4  
Y1:0                                Y2:2  
Z1:0                                Z2:3 p1 initiates a global snapshot by recording its local state and sending a marker to p2. Before p2 receives it, it sends message M to p1. marker message M p1                                  p2
X1:0     -------marker------->      X2:4  
Y1:0     <---------M----------      Y2:2  
Z1:0                                Z2:3 p1                                  p2
X1:0     -------marker------->      X2:4  
Y1:0     <---------M----------      Y2:2  
Z1:0                                Z2:3 p2 receives the marker, records its own state. p1 then receives message M and logs it separately as it already took a snapshot. Final snapshot: p1 M                                p2  
X1:0                                X2:4  
Y1:0                                Y2:2  
Z1:0                                Z2:3 p1 M                                p2  
X1:0                                X2:4  
Y1:0                                Y2:2  
Z1:0                                Z2:3 In SeaTunnel CDC, markers are sent to all nodes — readers, splitEnumerators, writers — and each stores its current state for recovery.

Apache

Supercharge Your ETL Pipeline with SeaTunnel’s Lock-Free CDC

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A 10-Minute Deep Dive Into the Core Architecture of Apache SeaTunnel and DataX

What the heck is Apache SeaTunnel?

How to Sync Data From MySQL to Doris Using SeaTunnel

How to Simplify Data Flow By Multi-Table Synchronization With Apache SeaTunnel

Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 1): Server Initialization

The Evolution of Apache SeaTunnel’s Technical Architecture and Its Applications in the AI Field

A 10-Minute Deep Dive Into the Core Architecture of Apache SeaTunnel and DataX

What the heck is Apache SeaTunnel?

How to Sync Data From MySQL to Doris Using SeaTunnel

How to Simplify Data Flow By Multi-Table Synchronization With Apache SeaTunnel

Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 1): Server Initialization

The Evolution of Apache SeaTunnel’s Technical Architecture and Its Applications in the AI Field

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps