This article is meant to be a sober reflection on the data lakehouse table format conversation I have had as a participant over the last two years. I've written the following articles over the years on the subject:
I am writing this article in response to these two pieces:
DISCLOSURE: I work for Dremio, the Data Lakehouse Platform, who is a contributor to Apache Iceberg and has robust support for Apache Iceberg Data Lakehouses (DDL, DML, Cataloging, Table Optimization) and also can read Delta Lake tables.
Choosing the right open-source table format for your data lakehouse is more top of mind than ever as people explore implementing data lakehouses and data lakehouse platforms. With three prominent options in Apache Iceberg, Apache Hudi, and Delta Lake, all vying to be the heart of your lakehouse, it's easy to get lost in the noise of feature sets and performance benchmarks. This blog aims to cut through the fear, uncertainty, and doubt (FUD) surrounding table formats, providing a clear path to understanding what really matters in your selection process.
All major table formats are mature projects that are constantly improving performance and feature sets. The existence of multiple formats is a boon to the industry, as it fosters a competitive environment that drives improvements across the board. However, when it comes down to choosing a format, it's not just about features and performance. The critical decision should be based on the ecosystem of tools you plan to use and how well they integrate with the format. By focusing on your use cases and testing with your actual workloads, you'll find the table format that best meets your needs.
Over the last two years, I've immersed myself in the world of table formats and lakehouses, a journey that I plan to continue for the foreseeable future. My entry into this conversation began with an article comparing different table formats. That piece, which became one of my most-viewed writings, initially focused on the various features of these formats. However, as I've updated the article over time, I've noticed the narrowing differences between them, leading to a significant realization.
What truly matters is not the array of features a format possesses but the ecosystem that supports these features and their usability. After all, what's the point of advanced functionality if it's not utilized effectively or is too complex to integrate? The real value lies in how these features are implemented within the ecosystem and how seamlessly they can be adopted into your workflows.
Ecosystem and Usability Over Features The first major takeaway from my experience is the paramount importance of the ecosystem and usability over the raw feature set. A table format might boast an impressive list of capabilities, but if those features are not accessible or practical for your team's use, their value diminishes. Thus, when evaluating table formats, consider the supporting tools, documentation, community, and ease of integration into your existing systems.
When delving into the world of table formats, it's easy to get caught up in the numbers game of performance benchmarks. While these metrics offer a snapshot of efficiency under ideal conditions, they seldom provide the full picture needed for informed decision-making.
Benchmarks, particularly those in Apache Spark environments, have their place in performance evaluation. However, they often reflect a momentary advantage, influenced by specific configurations and data sets that may not align with everyday usage.
The reality is that the performance of a table format is not static; it evolves with each update to its libraries, the processing too, and as users configure the table and tools to suit their unique workloads. True performance evaluation requires a broader lens, one that considers how a table format behaves with the suite of tools you rely on and the nature of your data.
Choosing a table format based on performance means looking beyond Spark-centric benchmarks. It involves testing how each format manages your specific data types and use cases across the tools and technologies that form the backbone of your data infrastructure. Only by assessing performance in this holistic manner can you ensure that your chosen table format aligns with your operational realities and performance expectations.
A critical aspect I'd like to clarify from my original blog post is the rationale behind using the number of open-source contributors as a key metric. This choice might have raised some eyebrows, especially considering the complex structure of repository management across many table formats. These projects often distribute their development efforts across multiple repositories, separating the core project—typically written in Java—from implementations in other programming languages like Python, Go, and Rust.
The decision to concentrate on contributors to the core format was deliberate. The core repository is where the most significant decisions regarding the format's evolution are made. It's the heart of the project, determining not just current functionalities but also setting the direction for future developments. The distribution of influence over these decisions is crucial. It reveals whether the evolution of the format is being guided by a diverse group of stakeholders or if it's under the control of a few, potentially leading to a bias in development priorities.
While there are undoubtedly many talented contributors working across the various language-specific repositories, their work often focuses on mirroring the functionality established in the core. Although valuable, this does not typically drive the project's overall direction. The core repository's contributor diversity is, therefore, a more accurate indicator of the project's health and its alignment with the broader community's needs.
From a consumer's perspective, the diversity of contributors to the core repository is more than a technicality; it's a safeguard against vendor lock-in, budget inflation, and the potential stagnation of innovation in a controlled market. A table format steered by a community with varied interests ensures that no single entity can dominate the project's direction to the detriment of the community. This competitive diversity fosters an environment where vendors can innovate on equal footing, driving technological advancements and more favorable pricing for end-users.
In aiming for a competitive market, the significance of an open format governed by a broad coalition cannot be overstated. The diversity of core repo contributors matters because it ensures that the format remains open, adaptable, and free from monopolistic control. By prioritizing formats with a wide array of contributors, we champion the principles of open innovation and collective progress that are foundational to the open source ethos.
It's clear that the table format conversation transcends mere technical specifications and performance metrics. The past two years have illuminated the paramount importance of ecosystem compatibility, the tangible benefits of a broad, diverse community of open-source contributors, and the nuanced understanding required to evaluate performance in real-world scenarios. As we navigate the choices between Apache Iceberg, Apache Hudi, and Delta Lake, it becomes evident that the true value lies in fostering an open, competitive market that drives innovation, ensures flexibility, and avoids vendor lock-in. By prioritizing formats that align with our technological ecosystems, supported by a vibrant community, and delivering performance on our workloads with our desired tooling, we can build data infrastructures that are not only robust and efficient but also adaptable and forward-looking.
Also published here.