174 reads New Story

What The Heck is Apache Polaris?

by Shawn GordonSeptember 11th, 2024

Too Long; Didn't Read

The Data space is almost as volatile as the AI space this year, with many players consolidating. We are seeing everything coalesce around decoupling data storage from compute. Table formats are a way to standardize lightly opinionated storage. Apache Iceberg seems to be where they are settling.

featured image - What The Heck is Apache Polaris?

Introduction

The Data space is almost as volatile as the AI space this year, with many players consolidating. In the summer of 2022, I wrote about The Data Table Format Wars, and the ecosystem has only gotten more interesting since then. I won’t get deep into the weeds on Data Lakes vs. Data Warehouse, etc., but suffice it to say, we are seeing everything coalesce around decoupling data storage from computing.

Table formats are a way to standardize lightly opinionated storage. As I opined a couple of years ago, Apache Iceberg seems to be where they are settling. These formats tend to use Apache Parquet on object storage and provide a method for tracking new data and data state changes.

In 2023, Databricks announced the open-sourcing of their Unity Catalog, and Onehouse announced OneTable (now X-Table). These projects are meant to give you a single interface to the three popular table formats: Delta, Hudi, and Iceberg. Then, June 2024 happened…

What Happened?

Snowflake was having its annual Snowflake Summit conference, and on June 3, 2024, they announced Apache Polaris (incubating), a vendor-neutral, open catalog for Apache Iceberg. The next day, and a week before the Databricks conference, Databricks announced their acquisition of Tabular, a company founded by the creators of Iceberg. This still seems weird to everyone from the outside because Iceberg is a spec; it isn’t a product, and Tabular isn’t even the main contributor to Iceberg. In my opinion, Databricks will somehow roll their Deltalake into Iceberg.

Behind the scenes, Snowflake had been trying to acquire Tabular. Something happened at the 11th hour, and Databricks came in and laid down a ton more money, and the Snowflake bid was discarded. Snowflake was already two years into their commitment to supporting Iceberg as external storage, which made a lot of sense for them regardless.

Snowflake has a reputation concerning the release speed after an announcement, so there was some skepticism, and some even wondered if any code had been written. It turns out Snowflake had been working on this for a while, and it was made available in very short order, with vendors like Dremio, Starburst, CelerData, and Upsolver showing solutions using it in less than two months: a delightful surprise, and a positive sign of things to come.

All of that is a long way to set up Apache Polaris, and hopefully, it will give you a sense of the importance that a good catalog plays in all of this. How will any of your compute tools work if you don’t have a unified system to describe all those parquet files in your data lake?

Polaris Overview

Apache Polaris is an open-source catalog implementation for Apache Iceberg tables. Its main features include:

REST API implementation: Polaris implements Iceberg's REST API, enabling interoperability across multiple query engines.
Multi-engine compatibility: It allows various query engines, such as Apache Spark, Apache Flink, Trino, and others, to read and write Iceberg tables through a single catalog.
Centralized management: Polaris provides centralized and secure access to Iceberg tables across REST-compatible query engines.
Namespace and metadata management: It supports creating namespaces for logical grouping and manages metadata for Iceberg tables.
Storage configurations: Polaris offers robust storage configurations for cloud providers like S3, Azure, and Google Cloud Storage.
Security features: It implements role-based access control (RBAC) and uses credential vending to secure query execution.
Open-source nature: Polaris is fully open-source, allowing for community contributions and vendor neutrality.
Interoperability: It enables multiple engines to be used on a single copy of data, reducing storage duplication and ETL costs.
Consistent access control: Polaris allows you to manage role-based access control and storage layer security consistently across engines from one place.
Flexibility in hosting: It can be hosted on various infrastructures, providing flexibility and reducing vendor lock-in.

And a lovely little graphic from the docs to put a picture on everything.

Borrowing from the Polaris docs for a picture once again, let’s dive into what is happening here in this entity hierarchy graphic:

We can have multiple catalogs in a single Polaris install; in this diagram, it is only Catalog1 that has any namespaces or tables defined. A single catalog can be either internal or external. The internal tables allow for read/write functionality, meaning you have full control. The external tables are currently read-only and are intended to mirror data that is in some external catalog, such as a Glue catalog, which Polaris will see as the source of truth once connected.

The catalog organizes your iceberg tables; you’ll want to configure it with your storage configuration for the applicable cloud system that you are on. This forms the first architectural layer in the Iceberg table spec.

A namespace can have an arbitrary level of nesting and is used to logically group Iceberg tables within a catalog. An Iceberg table will always belong to a namespace. With this configuration in place, you can then use the Polaris catalog to describe your Iceberg tables in such a way as to make them easily accessible and manageable using a large variety of tools.

That’s a quick overview of what is happening and what it looks like. The implementation inside of Snowflake provides a really handy UI to manage the implementation; other vendors have their own, and you can just do it through Spark if you like.

Summary

I continue to believe that Apache Iceberg is going to “win” the table format wars. Apache Polaris focusing on Iceberg is not a bad thing. I want to dig into how RBAC has been implemented in Polaris another time and see if it is as robust as what the Tabular product implemented, which I was a big fan of. I believe Polaris is a big win for the Iceberg community and will help accelerate adoption even further.

I’ve seen talk about the Dremio Nessie project getting rolled in, and I’m curious to see what that will look like.

I haven’t looked at Nessie in a couple of years, and I thought their git-like semantics mainly had been covered by the Iceberg tagging and branching, which I wrote about last year—yet another story for another day.

I’m a big fan of what is happening here. If you are in this space, this is the right train to jump on.

Check out my other What the Heck is… articles at the links below: