Introduction I like to spend time on Linkedin reading through posts from companies about news, new releases, funding, new products, etc.. During a recent perusal, I saw something about “ ”, and as much as I pay attention to the data space, I hadn’t heard of this one, so I thought I’d investigate and answer the question “What the heck is DuckDB?”. DuckDB Overview itself is an MIT-licensed open source project whose code is hosted on . is a commercial company formed by the creators of DuckDB in July 2021 to provide support, custom extensions, and even custom versions of the product as a way to monetize it. This model reminds me of the early days of open source monetization and is one I like. DuckDB GitHub DuckDB Labs DuckDB is briefly described as SQLite for analytic workloads. While SQLite is an embeddable, row-based, and b-tree indexed data store well suited for OLTP workloads, DuckDB is an embeddable column-based data store that uses vectorized processing to optimize OLAP workloads, you could also think of it as an embedded or a mutant offspring of SQLite and Redshift. Some of the features are: Amazon Redshift Simple installation Single-File storage format No server Fast processing Language library integrations Not reliant on any external config files or settings Programmatic SQL API Fully ACID compliant WASM (web assembly) version available OLTP OLAP For your operation workloads For your analytic workloads Shorter queries Longer queries for complex questions Tables are more highly normalized Tables are de-normalized Typically implemented as row-oriented data stores Typically implemented as column-oriented data stores Testing I decided to use their very clever WASM to try out querying some Parquet files, of which I grabbed some to play with from . I started with the “.files add” command to load up the parquet file: web-based shell here Then I did some basic SQL to check it out: duckdb> select count(*) from userdata1.parquet;
┌──────────────┐
│ count_star() │
╞══════════════╡
│         1000 │
└──────────────┘
Elapsed: 1 ms

duckdb> duckdb> select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌────────────┬───────────┬────────────────────────────┐
│ first_name ┆ last_name ┆ email                      │
╞════════════╪═══════════╪════════════════════════════╡
│ Emily      ┆ Stewart   ┆ estewart9@opensource.org   │
│ Annie      ┆ Torres    ┆ atorrest@ning.com          │
│ William    ┆ Green     ┆ wgreen63@phpbb.com         │
│ Jack       ┆ Medina    ┆ jmedina7y@fda.gov          │
│ Jeremy     ┆ Bennett   ┆ jbennettck@wikipedia.org   │
│ Carlos     ┆ Day       ┆ cdaycn@gravatar.com        │
│ Ryan       ┆ Mills     ┆ rmillsgj@angelfire.com     │
│ Betty      ┆ Gibson    ┆ bgibsonka@tamu.edu         │
│ Wanda      ┆ Stanley   ┆ wstanleymz@sourceforge.net │
│ Evelyn     ┆ Spencer   ┆ espencerpi@ted.com         │
│ George     ┆ Howard    ┆ ghowardqh@mapquest.com     │
└────────────┴───────────┴────────────────────────────┘
Elapsed: 2 ms You can even do an ‘explain’: duckdb> explain select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         first_name        │
│         last_name         │
│           email           │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          country          │
│         first_name        │
│         last_name         │
│           email           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  Filters: country=Nigeria │
│   AND country IS NOT NULL │
└───────────────────────────┘ A very nice feature with DuckDB is if you are working with Python for example, you can just add it as a library by adding “import duckdb” to your python script, and then it is in your python process, so it then feels very integrated into your program, unlike working with something like MySQL or Postgres. So, using our userdata1.parquet file, we could do something like this: import duckdb
myconnector = duckdb.connect('myduckdb.duckdb')
cursor = myconnector.cursor()
cursor.execute("""
CREATE TABLE userdata(
   registration_dttm  date,
   Id                 int,
   first_name         varchar, 
   Last_name          varchar,
   email              varchar,
   gender             varchar,
   ip_address         varchar,
   cc                 varchar,
   country            varchar,
   birthdate          varchar,
   salary             float,
   title              varchar,
   comments 
)
"""
)
 
cursor.execute("COPY userdata FROM 'userdata1.parquet' (HEADER)")
 
print(cursor.execute('select count(*) from userdata).fetchall())
cursor.close()
conn.close() In the code snippet shown above, we connect to the ‘myduckdb.duckdb’ database,  create a table that matches our parquet file, copy the data into it and then perform a simple count query. Summary This is a really cool project. While I’ve been aware of the advantages of columnar data stores for about eight years because of Sisense initially, I only started working with them more extensively in the past year. I’m especially excited by their WASM implementation and the clever things they did with as a data protocol for the data import and query results. Not only is it a great technical example of WASM, but it has some great utility as well. Then, as I was wrapping this up, I ran across a , which was the first DuckDB user group meeting from April 2022 that you might want to check out. Lots of great things on the roadmap. As a final note, I have nothing to do with this project or the people behind it, they don’t even know I wrote this, so if I seem like a bit of a fanboy, it’s because I’m legitimately very keen on the project. Arrow YouTube video of DuckCon

Amazon

Apache

Assembly

Funding

YouTube

What Does 50% Electric Cars Mean To The Grid?

The Data Table Format Wars

Check out my library

Nominated for 2022 - HackerNoon Contributor of the Year - Climate Change

Nominated for 2022 - HackerNoon Contributor of the Year - Hiring

Too Long; Didn't Read

What The Heck Is DuckDB?

What The Heck Is DuckDB?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Announcing the COBOL Streamhouse

6 Reasons to Use Amazon Redshift

Cost Effective Data Warehousing: Delta View and Partitioned Raw Table

Have Your Postgres Cake with Amazon Redshift and eat it, too.

How Costly is AWS Redshift Serverless?

Spark and Redshift: Which is better for big data?

Announcing the COBOL Streamhouse

6 Reasons to Use Amazon Redshift

Cost Effective Data Warehousing: Delta View and Partitioned Raw Table

Have Your Postgres Cake with Amazon Redshift and eat it, too.

How Costly is AWS Redshift Serverless?

Spark and Redshift: Which is better for big data?

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps