When using SeaTunnel 2.3.9 to sync data from Oracle to Doris, you may encounter garbled characters—especially if the Oracle database uses the ASCII character set. But don’t panic—this article walks you through why this happens and how to fix it. SeaTunnel 2.3.9 Oracle Doris ASCII character set why how to fix it 🧠 Root Cause The issue stems from how SeaTunnel reads data from Oracle. If Oracle is using a character set like ASCII, and you're syncing to Doris (which expects proper UTF-8 or other compatible encodings), Chinese characters can become unreadable. The key is to intercept and re-encode the data when it is read from the Oracle ResultSet. intercept and re-encode ResultSet 🔍 Understanding the SeaTunnel Reading Flow Let’s look at the SeaTunnel internals that handle JDBC data ingestion: 1. JdbcSourceFactory JdbcSourceFactory This class: Loads your source configurations.
Constructs JdbcSourceConfig and JdbcDialect.
Creates a JdbcSource instance. Loads your source configurations. Constructs JdbcSourceConfig and JdbcDialect. JdbcSourceConfig JdbcDialect Creates a JdbcSource instance. JdbcSource 2. JdbcSource JdbcSource This: Initializes a SourceSplitEnumerator to split the tasks.
Creates a JdbcSourceReader to execute them. Initializes a SourceSplitEnumerator to split the tasks. SourceSplitEnumerator Creates a JdbcSourceReader to execute them. JdbcSourceReader 3. JdbcSourceReader JdbcSourceReader Responsible for: Building the JdbcInputFormat.
Repeatedly calling the pollNext() method to fetch data. Building the JdbcInputFormat. JdbcInputFormat Repeatedly calling the pollNext() method to fetch data. pollNext() 4. pollNext() Method pollNext() This method: Calls open() in JdbcInputFormat to prepare the PreparedStatement and ResultSet.
Then calls nextRecord() to process the ResultSet and convert it to a SeaTunnelRow. Calls open() in JdbcInputFormat to prepare the PreparedStatement and ResultSet. open() JdbcInputFormat PreparedStatement ResultSet Then calls nextRecord() to process the ResultSet and convert it to a SeaTunnelRow. nextRecord() ResultSet SeaTunnelRow 5. nextRecord() and the Encoding Problem nextRecord() In JdbcInputFormat: JdbcInputFormat The nextRecord() method calls toInternal() in JdbcRowConverter.
The default implementation uses JdbcFieldTypeUtils.getString(rs, resultSetIndex). The nextRecord() method calls toInternal() in JdbcRowConverter. nextRecord() toInternal() JdbcRowConverter The default implementation uses JdbcFieldTypeUtils.getString(rs, resultSetIndex). JdbcFieldTypeUtils.getString(rs, resultSetIndex) 💥 Problem: If the ResultSet contains Chinese characters stored as ASCII, this method returns garbled text. Problem ✅ Solution Strategy We need to detect the source encoding and re-encode the data at the moment it's retrieved from the ResultSet. detect the source encoding at the moment it's retrieved Here’s how to do it: 🛠 Implementation Steps Step 1: Add Charset Parameters In JdbcInputFormat, add: JdbcInputFormat private final Map<String, String> params; private final Map<String, String> params; In the constructor: public JdbcInputFormat(JdbcSourceConfig config, Map<TablePath, CatalogTable> tables) {
    this.jdbcDialect = JdbcDialectLoader.load(config.getJdbcConnectionConfig().getUrl(), config.getCompatibleMode());
    this.chunkSplitter = ChunkSplitter.create(config);
    this.jdbcRowConverter = jdbcDialect.getRowConverter();
    this.tables = tables;
    this.params = config.getJdbcConnectionConfig().getProperties(); // <-- get charset info here
} public JdbcInputFormat(JdbcSourceConfig config, Map<TablePath, CatalogTable> tables) {
    this.jdbcDialect = JdbcDialectLoader.load(config.getJdbcConnectionConfig().getUrl(), config.getCompatibleMode());
    this.chunkSplitter = ChunkSplitter.create(config);
    this.jdbcRowConverter = jdbcDialect.getRowConverter();
    this.tables = tables;
    this.params = config.getJdbcConnectionConfig().getProperties(); // <-- get charset info here
} Step 2: Pass params to the Row Converter params In the nextRecord() method of JdbcInputFormat, update the method call to: nextRecord() JdbcInputFormat SeaTunnelRow seaTunnelRow = jdbcRowConverter.toInternal(resultSet, splitTableSchema, params); SeaTunnelRow seaTunnelRow = jdbcRowConverter.toInternal(resultSet, splitTableSchema, params); Step 3: Add Encoding Method In AbstractJdbcRowConverter, define: AbstractJdbcRowConverter public static String convertCharset(byte[] value, String charSet) {
    if (value == null || value.length == 0) {
        return null;
    }
    log.info("Value bytes: {}", Arrays.toString(value));
    try {
        return new String(value, charSet);
    } catch (UnsupportedEncodingException e) {
        throw new RuntimeException(e);
    }
} public static String convertCharset(byte[] value, String charSet) {
    if (value == null || value.length == 0) {
        return null;
    }
    log.info("Value bytes: {}", Arrays.toString(value));
    try {
        return new String(value, charSet);
    } catch (UnsupportedEncodingException e) {
        throw new RuntimeException(e);
    }
} Step 4: Modify toInternal() for String Types toInternal() In AbstractJdbcRowConverter, update the STRING type handling like so: AbstractJdbcRowConverter STRING case STRING:
    if (params == null || params.isEmpty()) {
        fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
    } else {
        String sourceCharset = params.get("sourceCharset");
        if ("GBK".equalsIgnoreCase(sourceCharset)) {
            fields[fieldIndex] = convertCharset(JdbcFieldTypeUtils.getBytes(rs, resultSetIndex), sourceCharset);
        } else {
            fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
        }
    }
    break; case STRING:
    if (params == null || params.isEmpty()) {
        fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
    } else {
        String sourceCharset = params.get("sourceCharset");
        if ("GBK".equalsIgnoreCase(sourceCharset)) {
            fields[fieldIndex] = convertCharset(JdbcFieldTypeUtils.getBytes(rs, resultSetIndex), sourceCharset);
        } else {
            fields[fieldIndex] = JdbcFieldTypeUtils.getString(rs, resultSetIndex);
        }
    }
    break; Step 5: Rebuild and Deploy After making the above changes: Rebuild the connector-jdbc module.
Replace the existing connector-jdbc-2.3.9.jar under SeaTunnel's connectors directory.
Restart the SeaTunnel cluster. Rebuild the connector-jdbc module. connector-jdbc Replace the existing connector-jdbc-2.3.9.jar under SeaTunnel's connectors directory. connector-jdbc-2.3.9.jar connectors Restart the SeaTunnel cluster. 🧾 Configuration Tips If your Oracle database does not have encoding issues, you don’t need to pass the sourceCharset property.
If needed, pass it like this in your config: If your Oracle database does not have encoding issues, you don’t need to pass the sourceCharset property. does not have encoding issues sourceCharset If needed, pass it like this in your config: sourceCharset=GBK sourceCharset=GBK To debug logging from connector-jdbc, check the worker logs in the SeaTunnel logs directory. To debug logging from connector-jdbc, check the worker logs in the SeaTunnel logs directory. connector-jdbc worker logs logs ✅ Summary By adding a simple charset-switching mechanism and tweaking the JDBC source implementation, you can eliminate garbled characters when syncing Oracle data to Doris using SeaTunnel. No more broken characters—your data pipeline just got smarter. 🚀

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

Oracle

Fixing Garbled Text When Syncing Oracle to Doris with SeaTunnel 2.3.9

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A 10-Minute Deep Dive Into the Core Architecture of Apache SeaTunnel and DataX

A/B Testing was a Jerk, Until we Found the Replacement for Druid

Advancing Observability Platforms: Upgrading Data Processing and Reducing Costs with Apache Doris

Building the Next-Generation Data Lakehouse: 10X Performance

Database in Fintech: How to Support 10,000 Dashboards Without Creating a Mess

Elasticsearch VS Apache Doris in Log Analysis

A 10-Minute Deep Dive Into the Core Architecture of Apache SeaTunnel and DataX

A/B Testing was a Jerk, Until we Found the Replacement for Druid

Advancing Observability Platforms: Upgrading Data Processing and Reducing Costs with Apache Doris

Building the Next-Generation Data Lakehouse: 10X Performance

Database in Fintech: How to Support 10,000 Dashboards Without Creating a Mess

Elasticsearch VS Apache Doris in Log Analysis

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps