The Data Table Format Wars

Written by progrockrec | Published 2022/07/04
Tech Story Tags: apache-hudi | databricks | iceberg | software-engineering | datatable | s3 | cloud-storage | data-table-format

TLDRIf you are considering the data lake, you're going to want to know about the table formats that are available that give you database-like functionality on your data lake. I talk about that here.via the TL;DR App

As I write this on June 29, 2022, the “Data + AI Summit” is on its last day in San Francisco. I’d been thinking about writing about this topic for nearly a year now, but the announcement from Databricks to open source all Delta Lake APIs as part of the Delta Lake 2.0 release. They also announced that it will be contributing all enhancements of Delta Lake to The Linux Foundation. That brings us to the “Table Format Wars” subject of this article where we will look at Delta Lake, Apache Hudi (created at Uber), and Apache Iceberg (created at Netflix). For the sake of this article, I’m going to assume you know what the data lakehouse is and how we got here. If you are unfamiliar, this is a great article to catch you up.

Background

In the long long ago, in the before time, we had databases, where compute, storage, security, indexing, and all that good stuff was all in one place. As computing needs advanced at a crazy rate, software was developed to address those needs. Now we were dumping structured, semi-structured, and unstructured data (audio, video, images) into storage like AWS S3. We needed a way to understand the structure of what was there. This gave rise to catalogs such as the __Hive Metastore __or AWS Glue that would describe the data. That was a start, but we needed to be able to deal with the data in the files like in a database and be able to insert/delete/update records, and that gave rise to the Hive table format. It was quite primitive compared to the options available today, and if I recall correctly, didn’t work with metastore yet, so users had to know the layouts of the files.

Today

Databricks built and released Deltlake as a table format in April 2019, but the open-source version was pretty limited and Databricks was really the only company doing any serious work on it. That’s what made their announcement about open-sourcing a bunch of stuff so interesting. I read it as a bit of a panic move as Iceberg is getting so much vendor adoption and so many vendor contributors. Iceberg is also the newer kid on the block open-sourced in May 2020. Hudi started life at Uber in 2016 and was open-sourced in 2017, and is used at some dang big companies other than Uber, like the Robinhood trading platform, Amazon, Bytedance, creators of TikTok, and many others.

Features

The way these table formats work to handle upserts, deletes, etc., is generally one of two methods, they are:

  • Copy on Write (CoW)
  • Merge on Read (MoR)

Keep in mind that these changes to the files are coming in as change logs, which means the latest versions need to be dealt with for queries, or time travel. Time travel is a cool feature of this configuration, but I’m not going to address it in this article. MoR tends to be faster than CoW, but this is a pretty detailed topic that you should research in-depth. Let’s do a high-level comparison of the three table formats to get you started. For the below grid, I’m borrowing some research from a Dremio webinar.

Feature Overview

Delta Lake

Hudi

Iceberg

ACID Transactions

Yes

Yes

Yes

Partition Evolution

No

No

Yes

Schema Evolution

Partial

Partial

Yes

Time Travel

Yes

Yes

Yes

File Formats Supported

Parquet

Orc, Parquet

Avro, Orc, Parquet

Schema Evolution




Add Column

Yes

Yes

Yes

Drop Column

No

Yes w/Spark

Yes

Rename Column

Yes

Yes w/Spark

Yes

Update Column

Yes

Yes w/Spark

Yes

Reorder Column

Yes

Yes w/Spark

Yes

Change partitioning w/out rewriting table

No

No

Yes

Use transforms of columns to specify partitions

Partial

No

Yes

Require understanding of table partitioning

Yes

No

Yes

File Pruning

Yes

Yes

Yes

Read Support




Athena

Yes

Yes

Yes

Beam

Yes

No

No

BigQuery

No

Yes

No

Databricks SQL

Yes

No

No

Drill

No

No

Yes

Flink

Yes

Yes

Yes

Hive

Yes

Yes

Yes

Impala

No

Yes

Yes

Presto

Yes

Yes

Yes

Redshift

Yes

Yes

No

Snowflake

Yes

No

Yes

Sonar

Yes

No

Yes

Spark

Yes

Yes

Yes

Trino

Yes

Yes

Yes

Write Support




Athena

No

No

Yes

Flink

Yes

Yes

Yes

Impala

No

Yes

Yes

Databricks Photon

Yes

No

No

Presto

No

No

Yes

Sonar

No

No

Yes

Spark

Yes

Yes

Yes

Trino

Yes

No

Yes

Observations

Part of the reason this ecosystem evolved was around performance, but part of it had to do with how cloud providers charge for their platforms. It is much cheaper for storage than it is for compute, so if you can just query your raw storage without putting it into a conventional database, you’ll reduce your costs. By NOT putting it in a database of some sort, we’ve had to develop a variety of file formats like Parquet, catalog systems to describe a variety of file formats so they present as a schema, query engines, security plug-ins, table formats to deal with transactions, indexing, and more. The appeal of systems like Redshift, Snowflake, Yugabyte, CockroachDB, and others, is that all those things tend to be built in, just like the databases of old. Yes, there is a lot of flexibility in this scenario as you can use all the bits and pieces that best suit your situation, but imagine if Amazon were to suddenly change their pricing policies with Graviton 5 (I’m just throwing out an idea here) because they got compute so cheap that they decided to make compute cheaper than storage and really drop egress fees. You could see a massive collapse of a segment of the tech sector. Kind of a scary thought.

Summary

DataBeans published a benchmark comparing Delta, Hudi, and Iceberg where Delta comes out on top, the day before I was starting this article. The next day, Onehouse, a commercialization of Hudi by the Hudi developers as I understand it, published a rebuttal benchmark between Delta and Hudi. Both provide details and access to the files so you can try them yourself. I think Onehouse makes some good arguments and it is odd that the DataBeans article is not attributed to any writers, and it came out at the same time as the Delta announcement.

The tests performed were ones that would show Delta in a better light with default configurations. It all seems just a little too convenient, meaning, it seems like it was more of a paid placement than an organic comparison, but I could be wrong (this article is not paid for, I did it on my own time and dime). While Delta and Hudi have broad consumer adoption, the former because it is part of the Databricks product and the latter because of being first to market, I think Iceberg is likely going to be the eventual winner in this space based on the commercial support that I’m seeing, but you never know how the market can change from some unexpected innovation.

As to what is best for your environment, that’s going to be up to your specific needs, I’m simply talking about the tech adoption. The nice thing about open-source is that it isn’t company-dependent, so you can keep using that tech regardless of if the commercial company that sold you support continues to exist or not.


Written by progrockrec | Technology and blockchain developer and enthusiast as well as a prolific musician.
Published by HackerNoon on 2022/07/04