2,210 reads

What the Heck Is Apache Paimon?

by Shawn GordonDecember 5th, 2023

Too Long; Didn't Read

Paimon can be thought of as a table format for batch and stream processing. The use case seems to be tightly coupled to Flink at the moment by giving you a more flexible and persistent storage layer for your streaming data. There is generally a lot of data coming in, and you probably need to be able to do something quickly from it.

featured image - What the Heck Is Apache Paimon?

Introduction

You’ve heard of data warehouses, you’ve probably heard of data lakes and the data lakehouse, but have you heard of the “Streamhouse”? Well, that’s what “they” call the new data tech unlock possible with Apache Paimon. So, just what the heck is it? Let’s find out!

A company named Ververica, the original creator of Apache Flink, looks pretty excited about Paimon, but Jingsong Lee primarily created it at Alibaba. The purpose initially was to address the inability to query a Flink Dynamic Table, which resulted in the community proposing FLIP-188: Introduce Built-in Dynamic Table Storage.

Paimon can be thought of as a table format for batch and stream processing. I’ve written a couple of times about the big three table formats here and here.

Paimon Overview

The short version of what Paimon does is that it provides a streaming storage layer and extends Flink's capability to do stream processing directly on the data lake.

The use case seems to be tightly coupled to Flink at the moment by giving you a more flexible and persistent storage layer for your streaming data.

Instead of using the ‘InMemory Catalog’, you would set up a ‘Paimon Catalog’ like other table formats. That would look something like this:

CREATE CATALOG paimon WITH (
   'type' = 'paimon',
   'Warehouse' = '<path to your warehouse>'
);
USE CATALOG PAIMON;

To explain the flow, I’m going to borrow a screenshot from the excellent blog by Giannis Polyzos,

If you aren’t familiar with what an LSM Tree (Log-Structured Merge) is, then I suggest this blog by Vishal Rana. I think he did a nice job explaining it.

So, Paimon can be used to replace message queues, providing a storage layer for Flink that is a table format. As a streaming data lake platform, it allows users to process data in both batch and streaming modes, supporting:

High-speed data ingestion
Change data tracking
Real-time analytics
High throughput data writing
Low-latency data queries
Batch writes and reads
Streaming updates
Changelog producing

Where It Applies

The first few use cases that occurred to me were gaming, IoT devices, stock trading, and services like Uber. There is generally a lot of data coming in quickly, and you probably need to be able to do something actionable from it.

I remember having to deal with combining streaming data from ad impressions and ad clicks to get insights on advertising data as another example.

Having this kind of a materialized view and a table format defined on it would certainly have been helpful. I bet you can think of some use cases that would help you out.

Borrowing another screenshot from Giannis Polyzos, we see how we are getting to what is being called a “Streamhouse.” I found this very interesting after my time in the PrestoDB (compute engine) and Apache Iceberg (table format) communities.

Streaming would come up fairly often with Iceberg, and there is an argument out there that the Apache Hudi table format is better for that use case.

Paimon can also have some issues with Change Data Capture (CDC) ingestion into a data lake. It simplifies the CDC pipeline, supporting synchronizing CDC data with schema changes, streaming changelog tracking, and a partial-update merge engine.

Summary

As I researched for this blog, I thought Paimon and Seatunnel would make an exciting combination for a real-time data warehouse. Then I ran across this blog, “Apache SeaTunnel and Paimon: Unleashing the Potential of Real-Time Data Warehousing” from back in May 2023, so clearly, I’m not the first to think about this idea.

With these new projects, we’re seeing an interesting shift in the streaming and real-time space, and I’m here for it. I wish I personally had a use case to try these things out on a big, live system.

We’re seeing more and more clever technology being developed to address new problems that arise. I wonder how the Amazon S3 Express One Zone Storage Class announcement might impact the “real-time” space. It’s all very exciting to watch.

You can read more "What the heck" articles at the following links:

What The Heck Is DuckDB?

What the Heck Is Malloy?

What the Heck is PRQL?

What the Heck is GlareDB?

What the Heck is SeaTunnel?

What the Heck is LanceDB?

What the Heck is SDF?