Final Project Report 2| Apache SeaTunnel Adds Metalake Support

Written by williamguo | Published 2025/11/20
Tech Story Tags: bigdata | apache-seatunnel | metalake | opensource | data-science | open-source-development | plugin-architecture | data-engineering

TLDRSensitive credentials are no longer hard-coded — connect securely through centralized metadata like Apache Gravitino for dynamic source management.via the TL;DR App

Over the past two weeks, we’ve conducted brief interviews with several outstanding student developers from the Summer of Open Source program to learn about their development experiences and insights.

Today, we’re sharing the full project report for one of the most exciting contributions — Metalake support in Apache SeaTunnel — to help the community better understand its technical design and latest progress.

I. Project Background

Currently, in Apache SeaTunnel’s task configuration, sensitive information such as database usernames and passwords is hard-coded into task scripts. This approach introduces several problems:

  1. Security Risks: Sensitive information is exposed in scripts, making data source credentials vulnerable to leaks.
  2. Maintenance Overhead: When data source configurations change, users must manually update all related task scripts, which is inefficient and error-prone.

To address these issues, this project introduces Metalake integration to centralize data source configuration management.

Through a data source ID mapping mechanism, users can easily update and manage connection information. The goal is to support the Apache Gravitino metadata catalog and reserve interfaces for future integration with other third-party metadata services.

Example REST API for retrieving Gravitino catalog info:

https://gravitino.apache.org/docs/0.9.0-incubating/api/rest/load-catalog

Project repository:

https://github.com/apache/seatunnel

Main implementation objectives:

  1. Adapt Metalake configuration loading
  2. Load Metalake-related configuration from seatunnel-env when a task starts.
  3. Refactor source and sink configuration logic
  4. Add sourceId for querying Metalake and replacing configuration placeholders dynamically.
  5. Plugin-based Metalake support integrated with Apache Gravitino
  6. Define a unified Metalake interface, enable Gravitino support, and keep the design easily extensible to future metadata catalogs.

II. Solution Overview

1. Metalake Configuration Adaptation

Goal: Load Metalake configuration during task startup.

Method: Define Metalake settings in seatunnel-env.sh or directly in the task configuration file.

Example in seatunnel-env.sh:

METALAKE_ENABLED=true
METALAKE_TYPE=gravitino
METALAKE_URL=http://localhost:8090/api/metalakes/metalake_name/catalogs/

Or within a task configuration:

env {
  metalake_enabled = true
  metalake_type = "gravitino"
  metalake_url = "http://localhost:8090/api/metalakes/metalake_name/catalogs/"
}

If the configuration exists in the task file, it’s automatically loaded.

If defined in seatunnel-env.sh, it can be accessed via System.getenv() at runtime.

2. Refactoring Source/Sink Configuration

2.1 Add sourceId to Source/Sink

Goal: Identify data sources in Metalake.

Example:

source {
  type = "mysql"
  sourceId = "mysql_datasource_001"
  url = "jdbc:mysql://localhost:3306/db"
  ...
}

2.2 Support Placeholder Replacement

Goal: Dynamically fetch credentials and replace placeholders using Metalake.

Method:

  • Detect metalakeEnabled and sourceId during configuration parsing.
  • Query Metalake via REST API and replace placeholders like ${username} or ${password}.

Steps:

Code example:

3. Plugin-Based Metalake and Gravitino Integration

3.1 Define Metalake Interface

Create a MetalakeClient interface providing methods for data source lookup.

3.2 Implement Apache Gravitino Client

Implement GravitinoClient based on the interface:

  • Use HTTP client to request Gravitino REST API.
  • Parse and map data source info to SeaTunnel configuration placeholders.

Code example:

3.3 Extensible Plugin Design

Add a factory mechanism to select client types dynamically (e.g., Gravitino, UnityCatalog, or DataHub).

3.4 Backward Compatibility

Ensure existing tasks are unaffected:

  • metalakeEnabled defaults to false.
  • Only triggers Metalake logic when explicitly enabled and sourceId is provided.

Code example:

III. Project Timeline

Timeframe: July 1, 2025 – September 30, 2025

Below is the detailed implementation plan and milestones for this project.

PhaseTimeTasksMilestones
Preparation PhaseJuly 1 – July 7, 2025- Finalize technical solution details
- Set up development environment
- Complete seatunnel-env.shconfiguration file format design
Technical solution confirmed and development environment prepared
Development Phase 1: Metalake Configuration AdaptationJuly 8 – July 20, 2025- Implement configuration read and load functions
- Integrate configuration loading into task context
- Test configuration load functionality
Metalake configuration and loading functions completed and passed unit testing
Development Phase 2: Source/Sink RefactoringJuly 21 – August 5, 2025- Add SourceToto source and sink configuration
- Implement field mapping logic
- Test data source replacement logic
Source/Sink configuration refactoring completed and passed integration testing
Development Phase 3: Plugin Support and Gravitino IntegrationAugust 6 – August 31, 2025- Define MetalakeClientinterface
- Implement Gravitino client integration
- Support plugin method
- Verify backward compatibility
Gravitino integration and plugin support completed, extensibility verified
Testing & Optimization PhaseSeptember 1 – September 15, 2025- Conduct comprehensive functional testing
- Fix bugs and optimize code
- Compile project documentation
All functional testing completed; final code and documentation submitted
Summary & Submission PhaseSeptember 16 – September 30, 2025- Summarize project deliverables
- Submit code to Apache SeaTunnel community
- Prepare project report
Project officially completed and accepted

IV. Project Progress

Completed Work

All core features have been developed, tested, and merged into the main repository.

Challenges and Solutions

While coding, most challenges were minor thanks to the guidance from mentor liugddx.

The main difficulty was the lengthy test suite: SeaTunnel’s integration tests are extensive and sometimes unstable due to network factors, requiring multiple retries.

This process tested my patience and attention to detail.

Test Case Design

A sample task configuration was created using Metalake-based MySQL as the source and Assert as the sink to validate correctness.

Integration tests were built and successfully passed in GitHub CI.

Future Work

Future improvements include extending support for more Metalake types beyond Apache Gravitino, enabling wider metadata interoperability.


Written by williamguo | William Guo, WhaleOps CEO, Apache Software Foundation Member
Published by HackerNoon on 2025/11/20