SQL (Structured Query Language) has long been the standard for interacting with data, providing a powerful and accessible language for data querying and manipulation. However, traditional data warehouses and databases often fall short when dealing with the scale and flexibility demanded by modern data workloads.
This is where Apache Iceberg and Dremio come in. Apache Iceberg is an open table format designed for large-scale data lakes, enabling reliable data management with features like ACID transactions, schema evolution, and time travel. Iceberg brings structure and governance to data lakes, making them more capable of handling enterprise data needs. Dremio, on the other hand, is a data lakehouse platform that brings SQL querying capabilities to data lakes, providing a unified interface to query and analyze data across various sources.
By the end of this tutorial, you'll understand the basics of SQL in Dremio and how to perform essential data operations with Apache Iceberg tables.
SQL, or Structured Query Language, is a language specifically designed for managing and querying data in relational databases. Its versatility and power make it ideal for a wide range of data operations, including data extraction, aggregation, and transformation. SQL's widespread use in data analysis and reporting has made it a cornerstone in the world of data management.
Apache Iceberg is an open-source table format that brings structure and governance to data lakes. Designed with scalability in mind, Iceberg offers features such as:
Dremio is a data lakehouse platform that unifies data access, enabling users to perform SQL queries across data lakes, warehouses, and other data sources through a single, user-friendly interface. Dremio simplifies data analytics by providing:
When combined, SQL, Apache Iceberg, and Dremio offer a powerful solution for data management and analysis. SQL provides the querying foundation, Apache Iceberg delivers the scalability and governance, and Dremio brings everything together in a streamlined, accessible environment. For businesses looking to harness the full potential of their data lakes, this stack delivers efficient querying, advanced data governance, and high performance.
Let's set up an environment to work with these tools and walk through practical examples of using SQL with Apache Iceberg tables in Dremio.
To start working with Apache Iceberg and Dremio, we'll set up a local environment using Docker Compose, a tool that allows us to configure and manage multiple containers with a single file. In this setup, we'll use:
This environment will give us a powerful foundation to perform SQL operations on Apache Iceberg tables with Dremio.
Open a text editor of your choice (such as VS Code, Notepad, or Sublime Text).
Create a new file named docker-compose.yml
in a new, empty folder. This file will define the services and configurations needed for our environment.
Copy and paste the following configuration into docker-compose.yml
:
version: "3"
services:
# Nessie Catalog Server Using In-Memory Store
nessie:
image: projectnessie/nessie:latest
container_name: nessie
networks:
- iceberg
ports:
- 19120:19120
# MinIO Storage Server
## Creates two buckets named lakehouse and lake
minio:
image: minio/minio:latest
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
networks:
- iceberg
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
entrypoint: >
/bin/sh -c "
minio server /data --console-address ':9001' &
sleep 5 &&
mc alias set myminio http://localhost:9000 admin password &&
mc mb myminio/lakehouse &&
mc mb myminio/lake &&
tail -f /dev/null
"
# Dremio
dremio:
platform: linux/x86_64
image: dremio/dremio-oss:latest
ports:
- 9047:9047
- 31010:31010
- 32010:32010
container_name: dremio
environment:
- DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
networks:
- iceberg
networks:
iceberg:
Explanation of the Services:
lakehouse
and lake
, to separate structured Iceberg data from raw data.With the docker-compose.yml
file ready, follow these steps to launch the environment:
Open a terminal (Command Prompt, PowerShell, or terminal app) and navigate to the folder where you saved docker-compose.yml
.
Run the following command to start all services in detached mode:
docker-compose up -d
Wait a few moments for the services to initialize. You can check if the services are running by using:
docker ps
This command should list nessie
, minio
, and dremio
as running containers.
After starting the containers, verify that each service is accessible:
http://localhost:9047
. You should see the Dremio login screen.http://localhost:9001
. Log in with the username admin
and password password
to access the MinIO console.To stop the environment when you're done, run the following command in the same folder as your docker-compose.yml
file:
docker-compose down -v
This command stops and removes all containers and associated volumes, allowing you to start fresh next time.
With our environment up and running, we’re ready to start using Dremio to create and manage Apache Iceberg tables. In the next section, we’ll explore how to connect Nessie to Dremio and begin querying our data.
Now that our environment is up and running, let’s connect to Dremio, which will act as our query engine, and configure Nessie as a source catalog. This setup will allow us to take advantage of Apache Iceberg’s versioned data management and perform SQL operations in a streamlined, unified environment.
http://localhost:9047
in your browser. You should see the Dremio login screen.Nessie acts as the catalog for our Iceberg tables, enabling us to manage data with version control features such as branching and merging. Let’s add Nessie as a source in Dremio.
Name: Enter a name for the source, such as lakehouse
.
Endpoint URL: Enter the endpoint for the Nessie API:
http://nessie:19120/api/v2
Authentication: Choose None (since Nessie is running locally and does not require additional credentials in this setup).
admin
(the MinIO username).password
(the MinIO password).lakehouse
(this is the bucket where our Iceberg tables will be stored).true
.minio:9000
.true
.lakehouse
(or the name you assigned) listed in the Datasets section of Dremio’s interface.In addition to Nessie, we can add MinIO as a general S3-compatible source in Dremio. This source allows us to access raw data files stored in the MinIO lake
bucket, enabling direct SQL queries on various file types (e.g., JSON, CSV, Parquet) without the need to define tables.
lake
.admin
.password
.true
to ensure compatibility with MinIO./lake
(the bucket name for general storage).true
.minio:9000
.lake
source will appear in the Datasets section.With both sources connected, you should see lakehouse
and lake
listed under Datasets in Dremio. These sources provide access to structured, versioned data in the lakehouse
bucket and general-purpose data in the lake
bucket.
Let's explore how to use SQL within Dremio to create tables, insert data, and perform various data operations on our Iceberg tables.
Now that our environment is configured and connected, let's dive into creating tables using SQL in Dremio. Apache Iceberg tables in Dremio allow us to take advantage of Iceberg’s powerful features, such as schema evolution and advanced partitioning.
CREATE TABLE
The CREATE TABLE
command in Dremio allows us to define a new Iceberg table with specific columns, data types, and optional partitioning. Below, we’ll cover the syntax and provide examples for creating tables.
CREATE TABLE
CREATE TABLE [IF NOT EXISTS] <table_name> (
<column_name1> <data_type>,
<column_name2> <data_type>,
...
)
[ PARTITION BY (<partition_transform>) ];
IF NOT EXISTS
: Optionally add this clause to create the table only if it does not already exist.<table_name>
to specify the location in the Nessie catalog.VARCHAR
, INT
, TIMESTAMP
).PARTITION BY
: Specify a partitioning strategy, which is especially useful for Iceberg tables. Iceberg supports several partition transforms, such as year, month, day, bucket, and truncate.Let’s create a simple table to store customer data.
CREATE TABLE lakehouse.customers (
id INT,
first_name VARCHAR,
last_name VARCHAR,
age INT
);
In this example:
We define a customers table within the lakehouse source, where each row represents a customer with an ID
, first name
, last name
, and age
.
To optimize queries, we can partition the customers table by the first letter of the last_name column using the truncate transform.
CREATE TABLE lakehouse.customers_partitioned (
id INT,
first_name VARCHAR,
last_name VARCHAR,
age INT
) PARTITION BY (truncate(1, last_name));
Here, we use the PARTITION BY
clause with truncate(1, last_name)
, which will partition the data by the first character of the last_name
column. Partitioning helps to improve query performance by allowing Dremio to read only the relevant data based on query filters.
If we have a table to store order data, we may want to partition it by the date the order was placed.
CREATE TABLE lakehouse.orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount DOUBLE
) PARTITION BY (month(order_date));
In this case, month(order_date)
partitions the table by the month of the order_date
field, making it easier to run queries filtered by month, as Iceberg will only read the relevant partitions.
Once the tables are created, you can view them in Dremio’s Datasets section:
customers
, customers_partitioned
,and
orders` tables listed.Now let's look at how to insert data into these tables using SQL.
With our tables created, the next step is to populate them with data. Dremio’s INSERT INTO
command allows us to add data to Apache Iceberg tables, whether inserting individual rows or multiple records at once.
INSERT INTO
INSERT INTO <table_name> [(<column1>, <column2>, ...)]
VALUES (value1, value2, ...), (value1, value2, ...), ...;
VALUES
: A list of values to insert. You can insert one or more rows by adding sets of values separated by commas.Let’s add a single row to the customers table.
INSERT INTO lakehouse.customers (id, first_name, last_name, age)
VALUES (1, 'John', 'Doe', 28);
In this example:
We specify values for each column in the customers table: id
, first_name
, last_name
, and age
.
This inserts a single record for a customer named John Doe, age 28.
To add multiple rows to a table in one command, list each row in the VALUES clause.
INSERT INTO lakehouse.customers (id, first_name, last_name, age)
VALUES
(2, 'Jane', 'Smith', 34),
(3, 'Alice', 'Johnson', 22),
(4, 'Bob', 'Williams', 45),
(5, 'Charlie', 'Brown', 30);
In this example:
For partitioned tables, Dremio and Iceberg automatically manage the partitioning based on the table’s partitioning rules. Let’s add some data to the customers_partitioned table, which is partitioned by the first letter of last_name.
INSERT INTO lakehouse.customers_partitioned (id, first_name, last_name, age)
VALUES
(6, 'Emma', 'Anderson', 29),
(7, 'Frank', 'Baker', 35),
(8, 'Grace', 'Clark', 41);
You can also insert data into a table by selecting data from another table. This is particularly useful if you need to copy data or load data from a staging table.
INSERT INTO lakehouse.customers_partitioned (id, first_name, last_name, age)
SELECT id, first_name, last_name, age
FROM lakehouse.customers
WHERE age > 30;
In this example:
We insert rows into customers_partitioned
by selecting records from the customers
table. Only customers older than 30 are inserted into customers_partitioned
.
To confirm that data was successfully inserted, you can use a SELECT query to retrieve and view the data:
Copy code
SELECT * FROM lakehouse.customers;
This command will display all rows in the customers table, allowing you to verify that your insertions were successful.
With INSERT INTO, you can populate your Iceberg tables with data, either by inserting individual rows, multiple records at once, or copying data from other tables. Next, let's explore how to query this data with SQL.
With data inserted into our tables, we can now use SQL to query and analyze it. Dremio supports various SQL features, including filtering, grouping, ordering, and even Iceberg’s unique time-travel capabilities.
SELECT
Query SyntaxThe SELECT
command allows you to retrieve data from a table. Here’s the basic syntax:
SELECT [ALL | DISTINCT] <columns>
FROM <table_name>
[WHERE <condition>]
[GROUP BY <expression>]
[ORDER BY <column> [DESC]]
[LIMIT <count>];
ALL
| DISTINCT
: ALL returns all values, while DISTINCT eliminates duplicates. If omitted, ALL is used by default.WHERE
: Filters records based on a condition.GROUP BY
: Groups records with similar values, allowing aggregate functions like COUNT
, SUM
, and AVG
.ORDER BY
: Sorts results by one or more columns; add DESC
for descending order.LIMIT
: Restricts the number of rows returned.To view all data in the customers table, use SELECT *:
SELECT * FROM lakehouse.customers;
This query retrieves every row and column in the customers table.
Use the WHERE
clause to filter records based on a condition. For instance, let’s retrieve all customers over the age of 30:
SELECT * FROM lakehouse.customers
WHERE age > 30;
This query returns only the rows where age is greater than 30.
The GROUP BY clause groups records based on a specified column, allowing you to calculate aggregates. For example, let’s count the number of customers by age:
SELECT age, COUNT(*) AS customer_count
FROM lakehouse.customers
GROUP BY age;
In this example:
You can sort query results by one or more columns. To get a list of customers ordered by age in descending order:
SELECT * FROM lakehouse.customers
ORDER BY age DESC;
This will display customers from the oldest to the youngest.
Use LIMIT to restrict the number of rows returned. This is useful for viewing a sample of your data.
SELECT * FROM lakehouse.customers
LIMIT 5;
This query will return only the first five rows in the customers table.
One of Iceberg’s powerful features is time-travel, which allows you to query historical versions of a table. You can specify a particular snapshot ID or timestamp to view data as it was at that moment.
Query by Snapshot ID:
SELECT * FROM lakehouse.customers AT SNAPSHOT '1234567890123456789';
Replace '1234567890123456789' with the actual snapshot ID.
Query by Timestamp:
SELECT * FROM lakehouse.customers AT TIMESTAMP '2024-01-01 00:00:00.000';
Replace '2024-01-01 00:00:00.000' with the desired timestamp. This lets you view the table as it existed at that specific time.
Window functions allow you to perform calculations across rows related to the current row within a specified window. For example, if we want to rank customers by age within groups, we can use RANK():
SELECT id, first_name, last_name, age,
RANK() OVER (ORDER BY age DESC) AS age_rank
FROM lakehouse.customers;
This query assigns a rank based on age, with the oldest customers ranked first.
To ensure your queries are correct, you can run them in Dremio’s SQL Runner and examine the results in the output pane. Dremio provides performance insights and query details, making it easy to optimize and validate your SQL queries.
With SELECT statements, you can retrieve, filter, group, and order data in Dremio, as well as take advantage of Iceberg’s time-travel capabilities. Next, we’ll look at how to update records in your tables using SQL.
In Dremio, you can use SQL to update existing records in Apache Iceberg tables, making it easy to modify data without rewriting entire datasets. The UPDATE
command lets you change specific columns for rows that meet certain conditions.
UPDATE
UPDATE <table_name>
SET <column1> = <value1>, <column2> = <value2>, ...
[WHERE <condition>];
SET
: Specifies the columns and new values to assign.WHERE
: An optional clause to filter the rows that should be updated. Without WHERE
, all rows in the table will be updated.Suppose we want to update the age of a specific customer. We can use the WHERE clause to target the correct row:
UPDATE lakehouse.customers
SET age = 29
WHERE id = 1;
In this example:
customer
with id
= 1 to 29.id
= 1 are affected.You can update multiple columns in a single UPDATE command. Let’s change both the first_name and last_name of a customer:
UPDATE lakehouse.customers
SET first_name = 'Jonathan', last_name = 'Doe-Smith'
WHERE id = 1;
Here:
first_name
and last_name
for the customer with id
= 1.WHERE
condition.The WHERE
clause allows you to apply updates based on specific conditions. For instance, let’s increase the age of all customers under 25 by 1 year:
UPDATE lakehouse.customers
SET age = age + 1
WHERE age < 25;
In this example:
This approach is useful for performing bulk updates based on a condition.
If you’re using Nessie to manage versions, you can update records within a specific branch. This allows you to make updates in an isolated environment, which you can later merge into the main branch.
First you'd need to create a new branch
CREATE BRANCH development IN lakehouse;
Then you can update records in the branch
UPDATE lakehouse.customers
AT BRANCH 'development'
SET age = 30
WHERE id = 3;
In this case:
MERGE BRANCH development INTO main IN lakehouse;
To confirm your updates, you can query the table to view the modified records:
SELECT * FROM lakehouse.customers WHERE id = 1;
This query will display the updated row, allowing you to verify that the changes were applied successfully.
UPDATE
command if you want to limit changes to a specific branch.Using the UPDATE
command, you can easily modify data in your Apache Iceberg tables in Dremio. Whether updating single rows or multiple records based on conditions, Dremio’s SQL capabilities make data management flexible and efficient. In the next section, we’ll explore how to alter a table’s structure using SQL.
As your data needs evolve, you may need to modify the structure of an Apache Iceberg table. Dremio’s ALTER TABLE
command provides flexibility to add, drop, or modify columns in existing tables, allowing your schema to evolve without significant disruptions.
ALTER TABLE
ALTER TABLE <table_name>
[ ADD COLUMNS ( <column_name> <data_type> [, ...] ) ]
[ DROP COLUMN <column_name> ]
[ ALTER COLUMN <column_name> SET MASKING POLICY <policy_name> ]
[ MODIFY COLUMN <column_name> <new_data_type> ];
ADD COLUMNS
: Adds new columns to the table.DROP COLUMN
: Removes a specified column from the table.ALTER COLUMN
: Allows you to set a masking policy for data security.MODIFY COLUMN
: Changes the data type of an existing column.To add a new column to an existing table, use the ADD COLUMNS
clause. Let’s add an email column to the customers table.
ALTER TABLE lakehouse.customers
ADD COLUMNS (email VARCHAR);
In this example:
VARCHAR
to store customer email addresses.NULL
as the default value in the new email column until data is populated.If a column is no longer needed, you can remove it using DROP COLUMN
. Let’s remove the age column from the customers table.
ALTER TABLE lakehouse.customers
DROP COLUMN age;
Here:
To change the data type of an existing column, use MODIFY COLUMN
. For example, let’s change the id column from INT
to BIGINT
to allow larger values.
ALTER TABLE lakehouse.customers
MODIFY COLUMN id BIGINT;
In this example:
BIGINT
, which can store larger values than INT
.INT
to BIGINT
).Data masking can enhance data security by obscuring sensitive information. In Dremio, you can apply a masking policy to a column, making sensitive data less accessible to unauthorized users.
ALTER TABLE lakehouse.customers
ALTER COLUMN email
SET MASKING POLICY mask_email (email);
In this case:
For Iceberg tables, you can adjust partitioning without rewriting the table. Let’s add a partition field to the customers table to partition data by the first letter of last_name
.
ALTER TABLE lakehouse.customers
ADD PARTITION FIELD truncate(1, last_name);
Here:
last_name
, making queries more efficient when filtering by last_name
.After altering a table, you can verify the changes by checking the schema in Dremio’s Datasets section or by running a SELECT
query to observe the modified structure:
SELECT * FROM lakehouse.customers;
Using the ALTER TABLE command in Dremio, you can evolve the structure of your Apache Iceberg tables by adding, modifying, or removing columns, as well as updating partitioning strategies. In the next section, we’ll look at how to delete records from tables using SQL.
Deleting specific records from an Apache Iceberg table in Dremio can be done using the DELETE
command. This allows you to remove rows based on conditions, keeping your data relevant and up-to-date without needing to rewrite the entire dataset.
DELETE
DELETE FROM <table_name>
[WHERE <condition>];
Suppose we want to delete records of customers under the age of 18. We can use the WHERE clause to filter these rows and remove them from the customers table.
DELETE FROM lakehouse.customers
WHERE age < 18;
In this example:
WHERE
clause ensures that only specific records are affected by the deletion.If you need to clear all data from a table but keep the table structure intact, simply omit the WHERE
clause.
DELETE FROM lakehouse.customers;
This command:
When using Nessie for versioned data management, you can delete records in an isolated branch. This allows for safe experimentation without affecting the main data.
DELETE FROM lakehouse.customers
AT BRANCH development
WHERE age > 60;
In this example:
To confirm that records were successfully deleted, run a SELECT
query on the table:
SELECT * FROM lakehouse.customers;
This command will display the remaining records, allowing you to verify that the desired rows were removed.
The DELETE
command in Dremio provides a straightforward way to remove unwanted data from your Apache Iceberg tables. This completes the basics of SQL operations with Apache Iceberg and Dremio, empowering you to handle data from creation to deletion with ease.
We explored the essentials of SQL operations using Apache Iceberg and Dremio. By combining Dremio’s powerful query engine with Apache Iceberg’s robust data management capabilities, you can efficiently handle large datasets, support schema evolution, and take advantage of advanced features like time-travel and branching. Here’s a quick recap of what we covered:
CREATE TABLE
command, we created Apache Iceberg tables, including partitioned tables for optimized performance.INSERT INTO
command, demonstrating single and batch inserts.SELECT
queries, we retrieved data, applied filters, grouped results, and explored Iceberg’s time-travel capabilities.UPDATE
command to modify specific records based on conditions, showing how to evolve data as needs change.ALTER TABLE
, we modified the structure of our tables, adding, dropping, and modifying columns as our data needs evolved.DELETE
command, enabling record removal based on conditions and managing data cleanly.With these SQL basics under your belt, here are a few ways to continue expanding your skills with Apache Iceberg and Dremio:
By mastering these core SQL operations, you’re well-prepared to build, maintain, and analyze data in a modern data lakehouse architecture. Whether you’re managing structured or unstructured data, Dremio and Apache Iceberg offer the tools you need for efficient, flexible, and high-performance data workflows.