Recently, the community published an article titled “Say Goodbye to Hand-Written Schemas! SeaTunnel’s Integration with Gravitino Metadata REST API Is a Really Cool Move”, which drew strong reactions from readers, with many saying, “This is really awesome!” “Say Goodbye to Hand-Written Schemas! SeaTunnel’s Integration with Gravitino Metadata REST API Is a Really Cool Move” The contributor behind this feature is extremely proactive, and it’s expected to be available soon (according to reliable sources, likely in version 3.0.0). To help the community better understand it, the contributor wrote a detailed article explaining the initial capabilities of the Gravitino REST API and how to use it—let’s take a closer look! 1. Background and Problems to Solve When using Apache SeaTunnel for batch or sync tasks, if the source is unstructured or semi-structured, the source usually requires an explicit schema definition (field names, types, order). the source usually requires an explicit schema definition In real production environments, this leads to several typical issues: Tables have many fields and complex types, making manual schema maintenance costly and error-prone
Upstream table structure changes (adding fields, changing types) require corresponding updates to SeaTunnel jobs
For existing tables, simply syncing data still requires repeated metadata description, leading to redundancy Tables have many fields and complex types, making manual schema maintenance costly and error-prone Upstream table structure changes (adding fields, changing types) require corresponding updates to SeaTunnel jobs For existing tables, simply syncing data still requires repeated metadata description, leading to redundancy Thus, the core question is: Can SeaTunnel directly reuse table structure definitions from an existing metadata system, instead of declaring schema repeatedly in jobs? Can SeaTunnel directly reuse table structure definitions from an existing metadata system, instead of declaring schema repeatedly in jobs? Can SeaTunnel directly reuse table structure definitions from an existing metadata system, instead of declaring schema repeatedly in jobs? This feature was introduced to solve this problem. 2. Introduction to Gravitino (Relevant Capabilities) Gravitino is a unified metadata management and access service, providing standardized REST APIs to manage and expose the following objects: Metalake (logical isolation unit)
Catalogs (e.g., MySQL, Hive, Iceberg)
Schema / Database
Table and its field definitions Metalake (logical isolation unit) Catalogs (e.g., MySQL, Hive, Iceberg) Schema / Database Table and its field definitions With Gravitino: Table structures can be centrally managed
Downstream systems can dynamically fetch schema definitions via HTTP APIs
No need to maintain field information in every compute or sync job Table structures can be centrally managed centrally managed Downstream systems can dynamically fetch schema definitions via HTTP APIs HTTP APIs No need to maintain field information in every compute or sync job The new capability introduced in SeaTunnel is: Support for automatically pulling table structures via schema_url provided by Gravitino in the source schema definition. Support for automatically pulling table structures via schema_url provided by Gravitino in the source schema definition. Support for automatically pulling table structures via schema_url 3. Local Test Environment Setup 3.1 Prepare MySQL Environment 3.1.1 Create Target Table Pre-create the target table test.demo_user in MySQL with the following SQL: test.demo_user CREATE TABLE `demo_user` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `user_code` varchar(32) NOT NULL,
  `user_name` varchar(64) DEFAULT NULL,
  `password` varchar(128) DEFAULT NULL,
  `email` varchar(128) DEFAULT NULL,
  `phone` varchar(20) DEFAULT NULL,
  `gender` tinyint DEFAULT NULL,
  `age` int DEFAULT NULL,
  `status` tinyint DEFAULT NULL,
  `level` int DEFAULT NULL,
  `score` decimal(10,2) DEFAULT NULL,
  `balance` decimal(12,2) DEFAULT NULL,
  `is_deleted` tinyint DEFAULT NULL,
  `register_ip` varchar(45) DEFAULT NULL,
  `last_login_ip` varchar(45) DEFAULT NULL,
  `login_count` int DEFAULT NULL,
  `remark` varchar(255) DEFAULT NULL,
  `ext1` varchar(100) DEFAULT NULL,
  `ext2` varchar(100) DEFAULT NULL,
  `ext3` varchar(100) DEFAULT NULL,
  `ext4` varchar(100) DEFAULT NULL,
  `ext5` varchar(100) DEFAULT NULL,
  `created_by` varchar(64) DEFAULT NULL,
  `updated_by` varchar(64) DEFAULT NULL,
  `created_time` datetime DEFAULT NULL,
  `updated_time` datetime DEFAULT NULL,
  `birthday` date DEFAULT NULL,
  `last_login_time` datetime DEFAULT NULL,
  `version` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_user_code` (`user_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; CREATE TABLE `demo_user` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `user_code` varchar(32) NOT NULL,
  `user_name` varchar(64) DEFAULT NULL,
  `password` varchar(128) DEFAULT NULL,
  `email` varchar(128) DEFAULT NULL,
  `phone` varchar(20) DEFAULT NULL,
  `gender` tinyint DEFAULT NULL,
  `age` int DEFAULT NULL,
  `status` tinyint DEFAULT NULL,
  `level` int DEFAULT NULL,
  `score` decimal(10,2) DEFAULT NULL,
  `balance` decimal(12,2) DEFAULT NULL,
  `is_deleted` tinyint DEFAULT NULL,
  `register_ip` varchar(45) DEFAULT NULL,
  `last_login_ip` varchar(45) DEFAULT NULL,
  `login_count` int DEFAULT NULL,
  `remark` varchar(255) DEFAULT NULL,
  `ext1` varchar(100) DEFAULT NULL,
  `ext2` varchar(100) DEFAULT NULL,
  `ext3` varchar(100) DEFAULT NULL,
  `ext4` varchar(100) DEFAULT NULL,
  `ext5` varchar(100) DEFAULT NULL,
  `created_by` varchar(64) DEFAULT NULL,
  `updated_by` varchar(64) DEFAULT NULL,
  `created_time` datetime DEFAULT NULL,
  `updated_time` datetime DEFAULT NULL,
  `birthday` date DEFAULT NULL,
  `last_login_time` datetime DEFAULT NULL,
  `version` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_user_code` (`user_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; 3.1.2 Create the Table Schema to Sync In practice, table structures might be managed centrally in components like paimon, hive, or hudi. For testing, the table schema points to the target table test.demo_user created in the previous step. paimon hive hudi test.demo_user 3.2 Register the Table Schema in Gravitino Gravitino supports direct database connections and scans all tables in a database Gravitino supports direct database connections and scans all tables in a database This table is managed in Gravitino as a table under the local-mysql catalog. This table is managed in Gravitino as a table under the local-mysql catalog. local-mysql Metalake: test_Metalake test_Metalake 3.3 Table Structure Access Explanation Table structures in Gravitino can be accessed via the REST API: http://localhost:8090/api/metalakes/test_Metalake/catalogs/${catalog}/schemas/${schema}/tables/${table} http://localhost:8090/api/metalakes/test_Metalake/catalogs/${catalog}/schemas/${schema}/tables/${table} In this test, the actual schema_url used is: schema_url http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user The returned JSON contains the complete field definitions of the demo_user table. demo_user 3.4 Local Deployment of SeaTunnel Since this feature hasn’t been officially released, you need to manually compile the latest dev branch and deploy it locally. dev 3.5 Prepare Data Files This test case uses a CSV file containing 2,000 records. 4. SeaTunnel Job Configuration 4.1 Core Configuration Example env {
  parallelism = 1
  job.mode = "BATCH"
}
source {
  LocalFile {
    path = "/Users/wangxuepeng/Desktop/seatunnel/apache-seatunnel-2.3.13-SNAPSHOT/test_data"
    file_format_type = "csv"
    schema {
      schema_url = "http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user"
    }
  }
}
sink {
  jdbc {
    url = "jdbc:mysql://localhost:3306/test"
    driver = "com.mysql.cj.jdbc.Driver"
    username = "root"
    password = "123456"
    database = "test"
    table = "demo_user"
    generate_sink_sql = true
  }
} env {
  parallelism = 1
  job.mode = "BATCH"
}
source {
  LocalFile {
    path = "/Users/wangxuepeng/Desktop/seatunnel/apache-seatunnel-2.3.13-SNAPSHOT/test_data"
    file_format_type = "csv"
    schema {
      schema_url = "http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user"
    }
  }
}
sink {
  jdbc {
    url = "jdbc:mysql://localhost:3306/test"
    driver = "com.mysql.cj.jdbc.Driver"
    username = "root"
    password = "123456"
    database = "test"
    table = "demo_user"
    generate_sink_sql = true
  }
} 4.2 Key Configuration Notes schema.schema_url

Points to the table metadata REST API in Gravitino
SeaTunnel automatically fetches the table schema at job start
No need to manually declare field lists in jobs


generate_sink_sql = true


Sink automatically generates INSERT SQL based on the parsed schema schema.schema_url

Points to the table metadata REST API in Gravitino
SeaTunnel automatically fetches the table schema at job start
No need to manually declare field lists in jobs schema.schema_url Points to the table metadata REST API in Gravitino
SeaTunnel automatically fetches the table schema at job start
No need to manually declare field lists in jobs Points to the table metadata REST API in Gravitino SeaTunnel automatically fetches the table schema at job start No need to manually declare field lists in jobs generate_sink_sql = true


Sink automatically generates INSERT SQL based on the parsed schema generate_sink_sql = true Sink automatically generates INSERT SQL based on the parsed schema Sink automatically generates INSERT SQL based on the parsed schema Sink automatically generates INSERT SQL based on the parsed schema 5. Data and Job Execution Results Log screenshot: Log screenshot: During job execution: Source automatically parses field structure via schema_url


CSV fields automatically align with the table schema


Data successfully written to MySQL demo_usertable Source automatically parses field structure via schema_url Source automatically parses field structure via schema_url schema_url CSV fields automatically align with the table schema CSV fields automatically align with the table schema Data successfully written to MySQL demo_usertable Data successfully written to MySQL demo_usertable demo_user 6. FAQ 6.1 Supported Connectors Currently, the dev branch supports file-type connectors including local, hdfs, s3, etc. dev local hdfs s3 6.2 Does schema_url support multiple tables? schema_url The feature does not affect multi-table functionality and can be used in combination, e.g.: source {
  LocalFile {
    tables_configs = [
      {
        path = "/seatunnel/read/metalake/table1"
        file_format_type = "csv"
        field_delimiter = ","
        row_delimiter = "\n"
        skip_header_row_number = 1
        schema {
          table = "db.table1"
          fields {
            c_string = string
            c_int = int
            c_boolean = boolean
            c_double = double
          }
        }
      },
      {
        path = "/seatunnel/read/metalake/table2"
        file_format_type = "csv"
        field_delimiter = ","
        row_delimiter = "\n"
        skip_header_row_number = 1
        schema {
          table = "db.table2"
          schema_url = "http://gravitino:8090/api/metalakes/test_metalake/catalogs/test_catalog/schemas/test_schema/tables/table2"
        }
      }
    ]
  }
} source {
  LocalFile {
    tables_configs = [
      {
        path = "/seatunnel/read/metalake/table1"
        file_format_type = "csv"
        field_delimiter = ","
        row_delimiter = "\n"
        skip_header_row_number = 1
        schema {
          table = "db.table1"
          fields {
            c_string = string
            c_int = int
            c_boolean = boolean
            c_double = double
          }
        }
      },
      {
        path = "/seatunnel/read/metalake/table2"
        file_format_type = "csv"
        field_delimiter = ","
        row_delimiter = "\n"
        skip_header_row_number = 1
        schema {
          table = "db.table2"
          schema_url = "http://gravitino:8090/api/metalakes/test_metalake/catalogs/test_catalog/schemas/test_schema/tables/table2"
        }
      }
    ]
  }
} 7. Feature Summary By introducing Gravitino schema_url–based automatic schema parsing, SeaTunnel gains the following advantages in data sync scenarios: Gravitino schema_url Eliminates repeated schema definitions, reducing job configuration complexity
Reuses a unified metadata management system, improving consistency
Job-friendly in case of table structure changes, significantly lowering maintenance costs Eliminates repeated schema definitions, reducing job configuration complexity Reuses a unified metadata management system, improving consistency Job-friendly in case of table structure changes, significantly lowering maintenance costs This feature is ideal for: Enterprises with mature metadata platforms


Large tables with many fields or frequent schema changes


Users seeking improved maintainability of SeaTunnel jobs Enterprises with mature metadata platforms Enterprises with mature metadata platforms Large tables with many fields or frequent schema changes Large tables with many fields or frequent schema changes Users seeking improved maintainability of SeaTunnel jobs Users seeking improved maintainability of SeaTunnel jobs 8. References Code PR:https://github.com/apache/seatunnel/pull/10402
schema_urlConfiguration Docs:https://seatunnel.apache.org/zh-CN/docs/introduction/concepts/schema-feature#schema_url Code PR:https://github.com/apache/seatunnel/pull/10402 Code PR https://github.com/apache/seatunnel/pull/10402 schema_urlConfiguration Docs:https://seatunnel.apache.org/zh-CN/docs/introduction/concepts/schema-feature#schema_url schema_url https://seatunnel.apache.org/zh-CN/docs/introduction/concepts/schema-feature#schema_url

This story contains new, firsthand information uncovered by the writer.

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

SeaTunnel × Gravitino: Schema URL–Driven Automatic Table Structure Detection

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

3 Ways to Seamlessly Integrate Databend with SeaTunnel for Streaming ETL

A 10-Minute Deep Dive Into the Core Architecture of Apache SeaTunnel and DataX

Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 2): Task Submission Process on the Client Side

Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 3): Server-Side Task Submission

Sync MySQL Data to S3 in Just 3 Steps Using Apache SeaTunnel

How to Run SeaTunnel in Separated Cluster Mode on K8s

3 Ways to Seamlessly Integrate Databend with SeaTunnel for Streaming ETL

A 10-Minute Deep Dive Into the Core Architecture of Apache SeaTunnel and DataX

Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 2): Task Submission Process on the Client Side

Source Code Analysis of Apache SeaTunnel Zeta Engine (Part 3): Server-Side Task Submission

Sync MySQL Data to S3 in Just 3 Steps Using Apache SeaTunnel

How to Run SeaTunnel in Separated Cluster Mode on K8s

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps