Apache Iceberg remains at the forefront of innovation, redefining how we think about data lakehouse architectures. In 2025, the Iceberg ecosystem is poised for significant advancements that will empower organizations to handle data more efficiently, securely, and at scale. From enhanced interoperability with modern data tools to new features that simplify data management, the year ahead promises to be transformative. In this blog, we’ll explore 10 exciting developments in the Apache Iceberg ecosystem that you should keep an eye on, offering a glimpse into the future of open data lakehouse technology.
One of the most anticipated updates in the Iceberg ecosystem for 2025 is the addition of a "Scan Planning" endpoint to the Iceberg REST Catalog specification. This enhancement will allow query engines to delegate scan planning—the process of reading metadata to determine which files are needed for a query—to the catalog itself. This new capability opens the door to several exciting possibilities:
Optimized Scan Planning with Caching: By handling scan planning at the catalog level, frequently submitted queries can benefit from cached scan plans. This optimization reduces redundant metadata reads and accelerates query execution, irrespective of the engine used to submit the query.
Enhanced Interoperability Between Table Formats: With the catalog managing scan planning, the responsibility of supporting table formats shifts from the engine to the catalog. This makes it possible for Iceberg REST-compliant catalogs to facilitate querying tables in multiple formats. For example, a catalog could generate file lists for queries across various table formats, paving the way for broader interoperability.
Looking ahead, the introduction of this endpoint is not only a step toward improving query performance but also a glimpse into a future where catalogs become the central hub for table format compatibility. To fully realize this vision, a similar endpoint for handling metadata writes may be introduced in the future, further extending the catalog's capabilities.
Interoperable views are another major development to watch in the Apache Iceberg ecosystem for 2025. While Iceberg already supports a view specification, the current approach has limitations: it stores the SQL used to define the view, but since SQL syntax varies across engines, resolving these views is not always feasible in a multi-engine environment.
To address this challenge, two promising solutions are being explored:
SQL Transpilation with Frameworks like SQLGlot: By leveraging SQL transpilation tools such as SQLGlot, the SQL defining a view can be translated between different dialects. This approach builds on the existing view specification, which includes a "dialect" property to identify the SQL syntax used to define the view. This enables engines to resolve views by translating the SQL into a dialect they support.
Intermediate Representation for Views: Another approach involves using an intermediate format to represent views, independent of SQL syntax. Two notable projects being discussed in this context are:
Apache Calcite: An open-source project that provides a framework for parsing, validating, and optimizing relational algebra queries. Calcite could serve as a bridge, converting SQL into a standardized logical plan that any engine can execute.
Substrait: A cross-language specification for defining and exchanging query plans. Substrait focuses on representing queries in a portable, engine-agnostic format, making it a strong candidate for enabling true interoperability.
These advancements aim to make views in Iceberg truly interoperable, allowing seamless sharing and resolution of views across different engines and workflows. Whether through SQL transpilation or an intermediate format, these improvements will significantly enhance Iceberg's flexibility in heterogeneous data environments.
A materialized view stores a query definition as a logical table, with precomputed data that serves query results. By shifting the computational cost to precomputation, materialized views significantly improve query performance while maintaining flexibility. The Iceberg community is working towards a common metadata format for materialized views, enabling their creation, reading, and updating across different engines.
materialization.data.max-staleness
property). Otherwise, the query engine determines the next steps, such as refreshing the data or falling back to the original view definition.Materialized views in Iceberg offer a way to optimize query performance while ensuring that optimizations are portable across systems. By providing a standard for metadata and refresh mechanisms, Iceberg hopes to enable organizations to harness the benefits of materialized views without being locked into specific query engines. This development will make Iceberg an even more compelling choice for building scalable, engine-agnostic data lakehouses.
The upcoming introduction of the variant data format in Apache Iceberg marks a significant advancement in handling semi-structured data. While Iceberg already supports a JSON data format, the variant data type offers a more efficient and versatile approach to managing JSON-like data, aligning with the Spark variant format.
The variant data format is designed to provide a structured representation of semi-structured data, improving performance and usability:
Variant Data Format Pull Request
The integration of geospatial data types into Apache Iceberg is poised to open up powerful capabilities for organizations managing location-based data. While geospatial data has long been supported by big data tools like GeoParquet, Apache Sedona, and GeoMesa, Iceberg's position as a central table format makes the addition of native geospatial support a natural evolution. Leveraging prior efforts such as Geolake and Havasu, this proposal aims to bring geospatial functionality into Iceberg without the need for project forks.
The geospatial extension for Iceberg will introduce:
POINT
, LINESTRING
, and POLYGON
.ST_COVERS
, ST_COVERED_BY
, and ST_INTERSECTS
for spatial querying.XZ2
to optimize query filtering. CREATE TABLE geom_table (geom GEOMETRY);
INSERT INTO geom_table VALUES ('POINT(1 2)', 'LINESTRING(1 2, 3 4)');
SELECT * FROM geom_table WHERE ST_COVERS(geom, ST_POINT(0.5, 0.5));
ALTER TABLE geom_table ADD PARTITION FIELD (xz2(geom));
CALL rewrite_data_files(table => `geom_table`, sort_order => `hilbert(geom)`);
Apache Polaris is expanding its capabilities with the concept of federated catalogs, allowing seamless connectivity to external catalogs such as Nessie, Gravitino, and Unity. This feature makes the tables in these external catalogs visible and queryable from a Polaris connection, streamlining Iceberg data federation within a single interface.
At present, Polaris supports read-only external catalogs, enabling users to query and analyze data from connected catalogs without duplicating data or moving it between systems. This functionality simplifies data integration and allows users to leverage the strengths of multiple catalogs from a centralized Polaris environment.
There is active discussion and interest within the community to extend this capability to read/write catalog federation. With this enhancement, users will be able to:
The move toward read/write federation make it easier for organizations to manage diverse data ecosystems. By bridging the gap between disparate catalogs, Polaris continues to simplify data management and empower users to unlock the full potential of their data.
A feature beign discussed in the Apache Polaris community is the table maintenance service, designed to streamline table optimization and maintenance workflows. This service would function as a notification system, broadcasting maintenance requests to subscribed tools, enabling automated and efficient table management.
The table maintenance service allows users to configure maintenance triggers based on specific conditions. For example, users could set a table to be optimized every 10 snapshots. When this condition is met, the service broadcasts a notification to subscribed tools such as Dremio, Upsolver and any other service that optimizes Iceberg tables.
Catalog versioning, a transformative feature currently available in the Nessie catalog, is under discussion for inclusion in the Apache Polaris ecosystem. Adding catalog versioning to Polaris would unlock a range of powerful capabilities, positioning Polaris as a unifying force for the most innovative ideas in the Iceberg catalog space.
Catalog versioning provides a robust foundation for advanced data management scenarios by enabling:
Discussions around bringing catalog versioning to Polaris also involve designing a new model that aligns with Polaris' architecture. This integration could enable:
If implemented, catalog versioning in Polaris would elevate its capabilities, making it an indispensable tool for organizations looking to modernize their data lakehouse operations.
Try Catalog Versioning on your Laptop
Apache Iceberg’s innovative delete file specification has been central to enabling efficient upserts by managing record deletions with minimal performance overhead. Currently, Iceberg supports two types of delete files:
While these mechanisms are effective, each comes with trade-offs. Position deletes can lead to high I/O costs when reconciling deletions during queries, while equality deletes, though fast to write, impose significant costs during reads and optimizations. Discussions in the Iceberg community propose enhancements to both approaches.
The key proposal is to transition position deletes from their current file-based storage to deletion vectors within Puffin files. Puffin, a specification for structured metadata storage, allows for compact and efficient storage of additional data.
Benefits of Storing Deletion Vectors in Puffin Files:
Another area of discussion is rethinking equality deletes to better suit streaming scenarios. The current design prioritizes fast writes but incurs steep costs for reading and optimizing. Possible enhancements include:
The Dremio Hybrid Catalog, currently in private preview, is set to become generally available sometime in 2025. Built on the foundation of the Polaris catalog, this managed Iceberg catalog is tightly integrated into Dremio, offering a streamlined and feature-rich experience for managing data across cloud and on-prem environments.
The general availability of the Dremio Hybrid Catalog will mark a significant milestone for organizations adopting Iceberg. By integrating Polaris' advanced capabilities into a managed catalog, Dremio is poised to deliver a seamless and efficient solution for managing data lakehouse environments. This innovation underscores Dremio's commitment to making Iceberg a cornerstone of modern data management strategies.
As we look ahead to 2025, the Apache Iceberg ecosystem is set to deliver groundbreaking advancements that will transform how organizations manage and analyze their data. From enhanced query optimization with scan planning endpoints and materialized views to broader support for geospatial and semi-structured data, Iceberg continues to push the boundaries of data lakehouse capabilities. Exciting developments like the Dremio Hybrid Catalog and updates to delete file specifications promise to make Iceberg even more efficient, scalable, and interoperable.
These innovations highlight the vibrant community driving Apache Iceberg and the collective effort to address the evolving needs of modern data platforms. Whether you're leveraging Iceberg for its robust cataloging features, seamless multi-cloud support, or cutting-edge query capabilities, 2025 is shaping up to be a year of remarkable growth and opportunity. Stay tuned as Apache Iceberg continues to lead the way in open data lakehouse technology, empowering organizations to unlock the full potential of their data.