A new version of ClickHouse was released in December 2023. Let’s see what exciting new features are inside and dive into one of them - new refreshable materialized views!
This release contains 21 new features, 18 performance optimizations, and 37 bug fixes.
Here are several exciting features I would like to you to know about:
The complete changelog you can read here.
In the meantime, I’ll show an example of the materialized views usage.
ClickHouse periodically runs the corresponding query and stores its result in a table, atomically replacing the table's previous contents, which differs from regular non-refreshable materialized views fundamentally: it’s not incremental, it’s entire recalculation and reformating data from the source tables.
Of course, there are trade-offs: it’s not online anymore, and this is the cost for aggregation queries. Non-refreshable materialized views are limited by using aggregation functions in the query since they apply only to the incremental data. There are no restrictions on the SELECT query. Table functions (e.g. url()), views, UNION, JOIN, are all allowed.
Since the feature is experimental, you have to enable it by setting
SET allow_experimental_refreshable_materialized_view = 1
Let’s create a sort of synthetic example: crawler traverses urls on the single domain using DFS and stores request headers, response headers, and content to the plain table.
CREATE TABLE IF NOT EXISTS crawler.pages (
url String,
req_headers Map(String, String),
resp_headers Map(String, String),
resp_code Int,
content String,
links Nested (
uri String,
titles Array(String)
)
)
ENGINE MergeTree
PRIMARY KEY url
ORDER BY url
This new feature allows me to build handy pipelines. The following example is pretty synthetic, however, it demonstrates
The original (“source”) table may contain duplicates for any page, so the first step of data processing would be the selection of the unique pages with the additional condition non-empty etag response header and response code 200 using data transformation like domain extraction and crawling date parsing.
CREATE MATERIALIZED VIEW IF NOT EXISTS crawler.uniqpages
REFRESH EVERY 2 MINUTE --- refresh could be SECOND|MINUTE|HOUR|DAY|WEEK|MONTH|YEAR
RANDOMIZE FOR 1 MINUTE
ENGINE = Memory
AS SELECT DISTINCT
domainWithoutWWW(url) as domain, -- extract domain from the url builtin func
path(url) as uri, -- extract path from the url builtin func
url as url,
resp_headers['etag'] as etag,
parseDateTime32BestEffort(resp_headers['date']) as crawled_at, -- datetime parsing
req_headers['user-agent'] as user_agent,
resp_code as resp_code,
links.uri as nei_uris,
links.titles as nei_titles
FROM
crawler.pages
WHERE
resp_headers['etag'] <> '' AND resp_code = 200
The second step is to precalculate per domain unique valid page stats. This step would be dependent on the uniqpages view refresh completion, without this dependency those both start simultaneously.
CREATE MATERIALIZED VIEW IF NOT EXISTS crawler.stats
REFRESH EVERY 2 MINUTE
DEPENDS ON crawler.uniqpages -- sync with uniqpages view, this ensures refresh to run in sequence
ENGINE = Memory
AS SELECT DISTINCT
domain,
count(1) as qnt
FROM
crawler.uniqpages
GROUP BY
1
ORDER BY
ALL
Check the uniqpages
for any record to verify results.
dd4a5ae26651 :) select * from uniqpages limit 1 format Vertical
SELECT *
FROM uniqpages
LIMIT 1
FORMAT Vertical
Query id: ba19be7b-0891-4c07-a707-62e5e3eaa3c4
Row 1:
──────
domain: levelup.gitconnected.com
uri: /%EF%B8%8F-how-to-deploy-a-static-website-or-a-server-side-rendered-app-with-aws-amplify-hosting-806b4d620458
url: https://levelup.gitconnected.com/%EF%B8%8F-how-to-deploy-a-static-website-or-a-server-side-rendered-app-with-aws-amplify-hosting-806b4d620458?source=---------4-----------------------
etag: W/"1c8ed-8WUFxf2KDh2sBbP4zrGTJCdxo4I"
crawled_at: 2023-12-31 14:20:54
user_agent: Mozilla/5.0 (X11; Linux x86_64; Testbot/1.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
resp_code: 200
nei_uris: ['https://levelup.gitconnected.com/?source=post_page-----806b4d620458--------------------------------']
nei_titles: [['Level Up Coding','Level Up Coding']]
1 row in set. Elapsed: 0.001 sec. Processed 22.91 thousand rows, 209.44 MB (30.17 million rows/s., 275.83 GB/s.)
Peak memory usage: 15.02 KiB.
Let’s check the stats view
dd4a5ae26651 :) select domain, qnt from stats order by domain
SELECT
domain,
qnt
FROM stats
ORDER BY domain ASC
Query id: 9a2975ea-b52a-44ec-acae-698bb51f1973
┌─domain───────────────────┬───qnt─┐
│ habr.com │ 12941 │
│ levelup.gitconnected.com │ 9295 │
│ w3.org │ 665 │
│ wikipedia.org │ 6 │
└──────────────────────────┴───────┘
4 rows in set. Elapsed: 0.001 sec.
… and in a couple of minutes
dd4a5ae26651 :) select domain, qnt from stats order by domain
SELECT
domain,
qnt
FROM stats
ORDER BY domain ASC
Query id: 9a2975ea-b52a-44ec-acae-698bb51f1973
┌─domain───────────────────┬───qnt─┐
│ habr.com │ 12947 │
│ levelup.gitconnected.com │ 9316 │
│ w3.org │ 665 │
│ wikipedia.org │ 6 │
└──────────────────────────┴───────┘
4 rows in set. Elapsed: 0.001 sec.
A Refreshable Materialized View is quite handy and the most interesting feature of this release. It allows me to build useful data ETL pipelines in a very simple way just using SQL letting the ClickHouse to do the routine job for you.
Let me know in the comments which of the new features excites you the most!