As I write this on June 29, 2022, the “Data + AI Summit” is on its last day in San Francisco. I’d been thinking about writing about this topic for nearly a year now, but the from to open source all Delta Lake APIs as part of the Delta Lake 2.0 release. They also announced that it will be contributing all enhancements of Delta Lake to The Linux Foundation. That brings us to the “Table Format Wars” subject of this article where we will look at , (created at Uber), and (created at Netflix). For the sake of this article, I’m going to assume you know what the data lakehouse is and how we got here. If you are unfamiliar, this is a to catch you up. announcement Databricks Delta Lake Apache Hudi Apache Iceberg great article Background In the long long ago, in the before time, we had databases, where compute, storage, security, indexing, and all that good stuff was all in one place. As computing needs advanced at a crazy rate, software was developed to address those needs. Now we were dumping structured, semi-structured, and unstructured data (audio, video, images) into storage like AWS S3. We needed a way to understand the structure of what was there. This gave rise to catalogs such as the __ __or AWS Glue that would describe the data. That was a start, but we needed to be able to deal with the data in the files like in a database and be able to insert/delete/update records, and that gave rise to the Hive table format. It was quite primitive compared to the options available today, and if I recall correctly, didn’t work with metastore yet, so users had to know the layouts of the files. Hive Metastore Today Databricks built and released Deltlake as a table format in April 2019, but the open-source version was pretty limited and Databricks was really the only company doing any serious work on it. That’s what made their announcement about open-sourcing a bunch of stuff so interesting. I read it as a bit of a panic move as Iceberg is getting so much vendor adoption and so many vendor contributors. Iceberg is also the newer kid on the block open-sourced in May 2020. Hudi started life at Uber in 2016 and was open-sourced in 2017, and is used at some dang big companies other than Uber, like the Robinhood trading platform, Amazon, Bytedance, creators of TikTok, and many others. Features The way these table formats work to handle upserts, deletes, etc., is generally one of two methods, they are: Copy on Write (CoW) Merge on Read (MoR) Keep in mind that these changes to the files are coming in as change logs, which means the latest versions need to be dealt with for queries, or time travel. Time travel is a cool feature of this configuration, but I’m not going to address it in this article. MoR tends to be faster than CoW, but this is a pretty detailed topic that you should research in-depth. Let’s do a high-level comparison of the three table formats to get you started. For the below grid, I’m borrowing some research from a webinar. Dremio Feature Overview Delta Lake Hudi Iceberg ACID Transactions Yes Yes Yes Partition Evolution No No Yes Schema Evolution Partial Partial Yes Time Travel Yes Yes Yes File Formats Supported Parquet Orc, Parquet Avro, Orc, Parquet Schema Evolution Add Column Yes Yes Yes Drop Column No Yes w/Spark Yes Rename Column Yes Yes w/Spark Yes Update Column Yes Yes w/Spark Yes Reorder Column Yes Yes w/Spark Yes Change partitioning w/out rewriting table No No Yes Use transforms of columns to specify partitions Partial No Yes Require understanding of table partitioning Yes No Yes File Pruning Yes Yes Yes Read Support Athena Yes Yes Yes Beam Yes No No BigQuery No Yes No Databricks SQL Yes No No Drill No No Yes Flink Yes Yes Yes Hive Yes Yes Yes Impala No Yes Yes Presto Yes Yes Yes Redshift Yes Yes No Snowflake Yes No Yes Sonar Yes No Yes Spark Yes Yes Yes Trino Yes Yes Yes Write Support Athena No No Yes Flink Yes Yes Yes Impala No Yes Yes Databricks Photon Yes No No Presto No No Yes Sonar No No Yes Spark Yes Yes Yes Trino Yes No Yes Observations Part of the reason this ecosystem evolved was around performance, but part of it had to do with how cloud providers charge for their platforms. It is much cheaper for storage than it is for compute, so if you can just query your raw storage without putting it into a conventional database, you’ll reduce your costs. By NOT putting it in a database of some sort, we’ve had to develop a variety of file formats like Parquet, catalog systems to describe a variety of file formats so they present as a schema, query engines, security plug-ins, table formats to deal with transactions, indexing, and more. The appeal of systems like Redshift, Snowflake, Yugabyte, CockroachDB, and others, is that all those things tend to be built in, just like the databases of old. Yes, there is a lot of flexibility in this scenario as you can use all the bits and pieces that best suit your situation, but imagine if Amazon were to suddenly change their pricing policies with Graviton 5 (I’m just throwing out an idea here) because they got compute so cheap that they decided to make compute cheaper than storage and really drop egress fees. You could see a massive collapse of a segment of the tech sector. Kind of a scary thought. Summary DataBeans published a comparing Delta, Hudi, and Iceberg where Delta comes out on top, the day before I was starting this article. The next day, Onehouse, a commercialization of Hudi by the Hudi developers as I understand it, published a between Delta and Hudi. Both provide details and access to the files so you can try them yourself. I think Onehouse makes some good arguments and it is odd that the DataBeans article is not attributed to any writers, and it came out at the same time as the Delta announcement. benchmark rebuttal benchmark The tests performed were ones that would show Delta in a better light with default configurations. It all seems just a little too convenient, meaning, it seems like it was more of a paid placement than an organic comparison, but I could be wrong (this article is not paid for, I did it on my own time and dime). While Delta and Hudi have broad consumer adoption, the former because it is part of the Databricks product and the latter because of being first to market, I think Iceberg is likely going to be the eventual winner in this space based on the commercial support that I’m seeing, but you never know how the market can change from some unexpected innovation. As to what is best for your environment, that’s going to be up to your specific needs, I’m simply talking about the tech adoption. The nice thing about open-source is that it isn’t company-dependent, so you can keep using that tech regardless of if the commercial company that sold you support continues to exist or not.