You’ve heard of data warehouses, you’ve probably heard of data lakes and the data lakehouse, but have you heard of the “Streamhouse”? Well, that’s what “they” call the new data tech unlock possible with
A company named Ververica, the original creator of
Paimon can be thought of as a table format for batch and stream processing. I’ve written a couple of times about the big three table formats
The short version of what Paimon does is that it provides a streaming storage layer and extends Flink's capability to do stream processing directly on the data lake.
The use case seems to be tightly coupled to Flink at the moment by giving you a more flexible and persistent storage layer for your streaming data.
Instead of using the ‘InMemory Catalog’, you would set up a ‘Paimon Catalog’ like other table formats. That would look something like this:
CREATE CATALOG paimon WITH (
'type' = 'paimon',
'Warehouse' = '<path to your warehouse>'
);
USE CATALOG PAIMON;
To explain the flow, I’m going to borrow a screenshot from the
If you aren’t familiar with what an LSM Tree (Log-Structured Merge) is, then I suggest
So, Paimon can be used to replace message queues, providing a storage layer for Flink that is a table format. As a streaming data lake platform, it allows users to process data in both batch and streaming modes, supporting:
The first few use cases that occurred to me were gaming, IoT devices, stock trading, and services like Uber. There is generally a lot of data coming in quickly, and you probably need to be able to do something actionable from it.
I remember having to deal with combining streaming data from ad impressions and ad clicks to get insights on advertising data as another example.
Having this kind of a materialized view and a table format defined on it would certainly have been helpful. I bet you can think of some use cases that would help you out.
Borrowing another screenshot from Giannis Polyzos, we see how we are getting to what is being called a “Streamhouse.” I found this very interesting after my time in the
Streaming would come up fairly often with Iceberg, and there is an argument out there that the Apache Hudi table format is better for that use case.
Paimon can also have some issues with Change Data Capture (CDC) ingestion into a data lake. It simplifies the CDC pipeline, supporting synchronizing CDC data with schema changes, streaming changelog tracking, and a partial-update merge engine.
As I researched for this blog, I thought Paimon and Seatunnel would make an exciting combination for a real-time data warehouse. Then I ran across this blog, “
With these new projects, we’re seeing an interesting shift in the streaming and real-time space, and I’m here for it. I wish I personally had a use case to try these things out on a big, live system.
We’re seeing more and more clever technology being developed to address new problems that arise. I wonder how the Amazon S3 Express One Zone Storage Class announcement might impact the “real-time” space. It’s all very exciting to watch.
You can read more "What the heck" articles at the following links: