3,211 reads

Refreshable Materialized Views and other features in ClickHouse Release 23.12

by Pavel BityukovJanuary 2nd, 2024

Too Long; Didn't Read

New version of ClickHouse was released in December 2023. It contains 21 new features, 18 performance optimizations, and 37 bug fixes. New Refreshable Materialized Views creates crontab-like refreshing views. New PASTE JOIN clause. Handy 1-to-1 row join of datasets. New mutation command ALTER TABLE <table> APPLY DELETED MASK.

Company Mentioned

featured image - Refreshable Materialized Views and other features in ClickHouse Release 23.12

ClickHouse NewYear Release

A new version of ClickHouse was released in December 2023. Let’s see what exciting new features are inside and dive into one of them - new refreshable materialized views!

What is on the board?

This release contains 21 new features, 18 performance optimizations, and 37 bug fixes.

Here are several exciting features I would like to you to know about:

New Refreshable Materialized Views. It creates crontab-like refreshing views.
New PASTE JOIN clause. Handy 1-to-1 row join of datasets.
New ORDER BY clause ALL. Syntax sugar to order query by every column from the SELECT statement.
New mutation command ALTER TABLE <table> APPLY DELETED MASK. It applies a mask written by lightweight delete to remove rows marked as deleted from the disk.
Ability to refer to ALIAS column in index (non-primary-key) definitions.

The complete changelog you can read here.

In the meantime, I’ll show an example of the materialized views usage.

Refreshable Materialized Views

ClickHouse periodically runs the corresponding query and stores its result in a table, atomically replacing the table's previous contents, which differs from regular non-refreshable materialized views fundamentally: it’s not incremental, it’s entire recalculation and reformating data from the source tables.

Of course, there are trade-offs: it’s not online anymore, and this is the cost for aggregation queries. Non-refreshable materialized views are limited by using aggregation functions in the query since they apply only to the incremental data. There are no restrictions on the SELECT query. Table functions (e.g. url()), views, UNION, JOIN, are all allowed.

Since the feature is experimental, you have to enable it by setting

SET allow_experimental_refreshable_materialized_view = 1

Let’s create a sort of synthetic example: crawler traverses urls on the single domain using DFS and stores request headers, response headers, and content to the plain table.

CREATE TABLE IF NOT EXISTS crawler.pages (
  url String,
  req_headers Map(String, String),
  resp_headers Map(String, String),
  resp_code Int, 
  content String,
  links Nested (
    uri String,
    titles Array(String)
  )
)
ENGINE MergeTree
PRIMARY KEY url
ORDER BY url

This new feature allows me to build handy pipelines. The following example is pretty synthetic, however, it demonstrates

The original (“source”) table may contain duplicates for any page, so the first step of data processing would be the selection of the unique pages with the additional condition non-empty etag response header and response code 200 using data transformation like domain extraction and crawling date parsing.

CREATE MATERIALIZED VIEW IF NOT EXISTS crawler.uniqpages
REFRESH EVERY 2 MINUTE   --- refresh could be SECOND|MINUTE|HOUR|DAY|WEEK|MONTH|YEAR
RANDOMIZE FOR 1 MINUTE   
ENGINE = Memory
AS SELECT DISTINCT
    domainWithoutWWW(url) as domain, -- extract domain from the url builtin func
    path(url) as uri, -- extract path from the url builtin func
    url as url,
    resp_headers['etag'] as etag,
    parseDateTime32BestEffort(resp_headers['date']) as crawled_at, -- datetime parsing
    req_headers['user-agent'] as user_agent,
    resp_code as resp_code,
    links.uri as nei_uris,
    links.titles as nei_titles
FROM
    crawler.pages
WHERE
    resp_headers['etag'] <> '' AND resp_code = 200

The second step is to precalculate per domain unique valid page stats. This step would be dependent on the uniqpages view refresh completion, without this dependency those both start simultaneously.

CREATE MATERIALIZED VIEW IF NOT EXISTS crawler.stats
REFRESH EVERY 2 MINUTE  
DEPENDS ON crawler.uniqpages -- sync with uniqpages view, this ensures refresh to run in sequence
ENGINE = Memory
AS SELECT DISTINCT
    domain,
    count(1) as qnt
FROM
    crawler.uniqpages
GROUP BY
    1
ORDER BY
    ALL

Check the uniqpages for any record to verify results.

dd4a5ae26651 :) select * from uniqpages limit 1 format Vertical

SELECT *
FROM uniqpages
LIMIT 1
FORMAT Vertical

Query id: ba19be7b-0891-4c07-a707-62e5e3eaa3c4

Row 1:
──────
domain:     levelup.gitconnected.com
uri:        /%EF%B8%8F-how-to-deploy-a-static-website-or-a-server-side-rendered-app-with-aws-amplify-hosting-806b4d620458
url:        https://levelup.gitconnected.com/%EF%B8%8F-how-to-deploy-a-static-website-or-a-server-side-rendered-app-with-aws-amplify-hosting-806b4d620458?source=---------4-----------------------
etag:       W/"1c8ed-8WUFxf2KDh2sBbP4zrGTJCdxo4I"
crawled_at: 2023-12-31 14:20:54
user_agent: Mozilla/5.0 (X11; Linux x86_64; Testbot/1.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
resp_code:  200
nei_uris:   ['https://levelup.gitconnected.com/?source=post_page-----806b4d620458--------------------------------']
nei_titles: [['Level Up Coding','Level Up Coding']]

1 row in set. Elapsed: 0.001 sec. Processed 22.91 thousand rows, 209.44 MB (30.17 million rows/s., 275.83 GB/s.)
Peak memory usage: 15.02 KiB.

Let’s check the stats view

dd4a5ae26651 :) select domain, qnt from stats order by domain

SELECT
  domain,
  qnt
FROM stats
ORDER BY domain ASC

Query id: 9a2975ea-b52a-44ec-acae-698bb51f1973

┌─domain───────────────────┬───qnt─┐
│ habr.com                 │ 12941 │
│ levelup.gitconnected.com │  9295 │
│ w3.org                   │   665 │
│ wikipedia.org            │     6 │
└──────────────────────────┴───────┘

4 rows in set. Elapsed: 0.001 sec.

… and in a couple of minutes

dd4a5ae26651 :) select domain, qnt from stats order by domain

SELECT
  domain,
  qnt
FROM stats
ORDER BY domain ASC

Query id: 9a2975ea-b52a-44ec-acae-698bb51f1973

┌─domain───────────────────┬───qnt─┐
│ habr.com                 │ 12947 │
│ levelup.gitconnected.com │  9316 │
│ w3.org                   │   665 │
│ wikipedia.org            │     6 │
└──────────────────────────┴───────┘

4 rows in set. Elapsed: 0.001 sec.

Conclusion

A Refreshable Materialized View is quite handy and the most interesting feature of this release. It allows me to build useful data ETL pipelines in a very simple way just using SQL letting the ClickHouse to do the routine job for you.

Let me know in the comments which of the new features excites you the most!