This article provides a detailed walkthrough of how to achieve full data synchronization from MySQL to PostgreSQL using Apache SeaTunnel 2.3.9. We cover the complete end-to-end process — from environment setup to production validation. Let’s dive into the MySQL-to-PostgreSQL synchronization scenario.
Version Requirements:
Run the following SQL command to check the version:
-- Check version information
select version();
-- View replication-related variables
show variables where variable_name in ('log_bin', 'binlog_format', 'binlog_row_image', 'gtid_mode', 'enforce_gtid_consistency');
For MySQL CDC data synchronization, SeaTunnel needs to read the MySQLbinlog
and act as a slave node in the MySQL cluster.
Note: In MySQL 8.0+,binlog
is enabled by default, but replication mode must be enabled manually.
-- Enable master-slave replication (execute in sequence)
-- SET GLOBAL gtid_mode=OFF;
-- SET GLOBAL enforce_gtid_consistency=OFF;
SET GLOBAL gtid_mode=OFF_PERMISSIVE;
SET GLOBAL gtid_mode=ON_PERMISSIVE;
SET GLOBAL enforce_gtid_consistency=ON;
SET GLOBAL gtid_mode=ON;
A user must haveREPLICATION SLAVE
andREPLICATION CLIENT
privileges:
-- Grant privileges to the user
CREATE USER 'test'@'%' IDENTIFIED BY 'password';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'test';
FLUSH PRIVILEGES;
By default, SeaTunnel logs output to a single file. For production, it’s preferable to have separate log files per job. Update the logging configuration inlog4j2.properties
:
############################ log output to file #############################
# rootLogger.appenderRef.file.ref = fileAppender
# Change log output to use independent log files for each job
rootLogger.appenderRef.file.ref = routingAppender
############################ log output to file #############################
For production clusters, it is recommended to install SeaTunnel under the/opt
directory and point theSEATUNNEL_HOME
environment variable accordingly.
If multiple versions exist, create a symbolic link to align with the server deployment directory:
# Create a symlink
ln -s /opt/apache-seatunnel-2.3.9 /opt/seatunnel
# Set environment variable
export SEATUNNEL_HOME=/opt/seatunnel
For Linux servers, add the following lines to/etc/profile.d/seatunnel.sh
:
echo 'export SEATUNNEL_HOME=/opt/seatunnel' >> /etc/profile.d/seatunnel.sh
echo 'export PATH=$SEATUNNEL_HOME/bin:$PATH' >> /etc/profile.d/seatunnel.sh
source /etc/profile.d/seatunnel.sh
Note: The configuration below does not cover all options but illustrates common production settings.
env {
job.mode = "STREAMING"
job.name = "DEMO"
parallelism = 3
checkpoint.interval = 30000 # 30 seconds
checkpoint.timeout = 30000 # 30 seconds
job.retry.times = 3
job.retry.interval.seconds = 3 # 3 seconds
}
The first step is setting up theenv
module, which operates in a streaming mode. Therefore, it’s essential to specify the configuration mode asSTREAMING
.
Configuring a task name is crucial for identifying and managing jobs in a production environment. Naming conventions based on database or table names can help with monitoring and administration.
Here, we set the parallelism to3, but this value can be adjusted based on the cluster size and database performance.
source {
MySQL-CDC {
base-url = "jdbc:mysql://192.168.8.101:3306/test?serverTimezone=Asia/Shanghai"
username = "test"
password = "123456"
database-names = ["test"]
# table-names = ["test.test_001","test.test_002"]
table-pattern = "test\\.test_.*" # The first dot is a literal character, requiring escaping; the second dot represents any single character.
table-names-config = [
{"table":"test.test_002","primaryKeys":["id"]}
]
startup.mode = "initial" # First sync all historical data, then incremental updates
snapshot.split.size = "8096"
snapshot.fetch.size = "1024"
server-id = "6500-8500"
connect.timeout.ms = 30000
connect.max-retries = 3
connection.pool.size = 20
exactly_once = false # In analytical scenarios, disabling exactly-once consistency allows some duplicates/losses for better performance.
schema-changes.enabled = true # Enable schema evolution to avoid frequent modifications; supports add, rename, drop operations.
}
}
Time Zone Configuration: It’s recommended to specify the MySQL connection timezone to prevent discrepancies when extractingdatetime
ortimestamp
data.
User Credentials:
Typically, each database is assigned to a separate task. Here, we specify only thetest
database.Two methods can be used:
Important:When using regular expressions, both the database name and table name must be included. The
.
character, which separates them, must be escaped (\\.
).
For example, to match tables prefixed withtest_
, we use:
test\\.test_.*
.
) represents a literal separator, requiring escaping (\\.
)..
) representsany single characterin regex.
Additionally, for tables without primary keys, logical primary keys can be specified manually to facilitate data synchronization.
The default startup mode is “initial”, which means:
Sharding & Fetching
Server ID
For CDC-based data synchronization,exactly-once consistency is often not required in analytical scenarios.
It’s highly recommended to enable schema evolution, which:
Allows automatic table modifications (e.g., adding/removing fields)
Reduces the need for manual job updates when the schema changes
However,downstream tasks may fail if they rely on a field that was modified.Supported schema changes:✔️ADD COLUMN
✔️DROP COLUMN
✔️RENAME COLUMN
✔️MODIFY COLUMN
Note: Schema evolution does not support
CREATE TABLE
orDROP TABLE
.
The sink configuration inserts data into PostgreSQL.
sink {
jdbc {
url = "jdbc:postgresql://192.168.8.101:5432/test"
driver = "org.postgresql.Driver"
user = "postgres"
password = "123456"
generate_sink_sql = true
database = "test"
table = "${database_name}.${table_name}"
schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
data_save_mode = "APPEND_DATA"
# enable_upsert = false
}
}
Key Considerations:
JDBC Connection:
Auto SQL Generation:
generate_sink_sql
lets SeaTunnel automatically create tables and generate INSERT
,DELETE
, andUPDATE
statements.
Schema Handling:
User Permissions:
The PostgreSQL user must have table creation permissions if using auto-schema generation.
For more details, refer to the official documentation:🔗SeaTunnel MySQL-CDC Connector Docs
Apache SeaTunnel supports placeholders, which dynamically adjust table names based on the source data.
For example:
table = "${database\_name}.${table\_name}"
Theschema_save_mode
parameter plays a crucial role in database-wide synchronization. It simplifies the process by automatically creating tables in the target database, eliminating the need for manual table creation steps.
Another key configuration isAPPEND_DATA
, which is particularly useful when the target database already contains previously synchronized data. This setting prevents the accidental deletion of existing records , making it a safer choice for most scenarios. However, if your use case requires a different approach, you can modify this setting according to the official documentation guidelines.
Another important parameter isenable_upsert
. If you can guarantee that the source data contains no duplicate records , disabling upsert (enable_upsert = false
) can significantly enhance synchronization performance . This is because, without upsert, the system does not need to check for existing records before inserting new ones.
However, if there is a possibility of duplicate records in the source data, it is strongly recommended to keep **Upsert enabled (enable_upsert = true
). This ensures that records are inserted or updated based on their primary key , preventing duplication issues.
For detailed parameter explanations and further customization options , please refer to the official Apache SeaTunnel documentation.
Once your configuration file is ready, submit the job using the SeaTunnel command-line tool:
./bin/start-seatunnel.sh --config /path/to/config.yaml --async
Key Parameters:
--config
: Specifies the path to your configuration file.--async
: Submits the job asynchronously, allowing the command line to exit while the job continues in the background.
After submission, you can monitor the job via SeaTunnel’s cluster UI. In version 2.3.9, SeaTunnel provides a visual interface where you can view job logs, execution status, and data throughput details.
For this demonstration, we created two tables (test_001
andtest_002
) and inserted sample data into MySQL. Using SeaTunnel's synchronization tasks, the data was successfully synchronized to PostgreSQL. The demonstration included insertions, deletions, updates, and even table schema modifications—all of which were reflected in real time on PostgreSQL.
Key Points:
Apache SeaTunnel focuses on data integration and synchronization, addressing common challenges such as: