ClickHouse NewYear Release A new version of ClickHouse was released in December 2023. Let’s see what exciting new features are inside and dive into one of them - new refreshable materialized views! What is on the board? This release contains 21 new features, 18 performance optimizations, and 37 bug fixes. Here are several exciting features I would like to you to know about: New Refreshable Materialized Views. It creates crontab-like refreshing views. New PASTE JOIN clause. Handy 1-to-1 row join of datasets. New ORDER BY clause ALL. Syntax sugar to order query by every column from the SELECT statement. New mutation command ALTER TABLE <table> APPLY DELETED MASK. It applies a mask written by lightweight delete to remove rows marked as deleted from the disk. Ability to refer to ALIAS column in index (non-primary-key) definitions. The complete changelog you can read . here In the meantime, I’ll show an example of the materialized views usage. Refreshable Materialized Views ClickHouse runs the corresponding query and stores its result in a table, replacing the table's previous contents, which differs from regular non-refreshable materialized views fundamentally: it’s not incremental, it’s entire recalculation and reformating data from the source tables. periodically atomically Of course, there are trade-offs: it’s not online anymore, and this is the cost for aggregation queries. Non-refreshable materialized views are limited by using aggregation functions in the query since they apply only to the incremental data. There are no restrictions on the SELECT query. Table functions (e.g. url()), views, UNION, JOIN, are all allowed. Since the feature is experimental, you have to enable it by setting SET allow_experimental_refreshable_materialized_view = 1 Let’s create a sort of synthetic example: crawler traverses urls on the single domain using DFS and stores request headers, response headers, and content to the plain table. CREATE TABLE IF NOT EXISTS crawler.pages ( url String, req_headers Map(String, String), resp_headers Map(String, String), resp_code Int, content String, links Nested ( uri String, titles Array(String) ) ) ENGINE MergeTree PRIMARY KEY url ORDER BY url This new feature allows me to build handy pipelines. The following example is pretty synthetic, however, it demonstrates The original (“source”) table may contain duplicates for any page, so the first step of data processing would be the selection of the unique pages with the additional condition non-empty etag response header and response code 200 using data transformation like domain extraction and crawling date parsing. CREATE MATERIALIZED VIEW IF NOT EXISTS crawler.uniqpages REFRESH EVERY 2 MINUTE --- refresh could be SECOND|MINUTE|HOUR|DAY|WEEK|MONTH|YEAR RANDOMIZE FOR 1 MINUTE ENGINE = Memory AS SELECT DISTINCT domainWithoutWWW(url) as domain, -- extract domain from the url builtin func path(url) as uri, -- extract path from the url builtin func url as url, resp_headers['etag'] as etag, parseDateTime32BestEffort(resp_headers['date']) as crawled_at, -- datetime parsing req_headers['user-agent'] as user_agent, resp_code as resp_code, links.uri as nei_uris, links.titles as nei_titles FROM crawler.pages WHERE resp_headers['etag'] <> '' AND resp_code = 200 The second step is to precalculate per domain unique valid page stats. This step would be dependent on the uniqpages view refresh completion, without this dependency those both start simultaneously. CREATE MATERIALIZED VIEW IF NOT EXISTS crawler.stats REFRESH EVERY 2 MINUTE DEPENDS ON crawler.uniqpages -- sync with uniqpages view, this ensures refresh to run in sequence ENGINE = Memory AS SELECT DISTINCT domain, count(1) as qnt FROM crawler.uniqpages GROUP BY 1 ORDER BY ALL Check the for any record to verify results. uniqpages dd4a5ae26651 :) select * from uniqpages limit 1 format Vertical SELECT * FROM uniqpages LIMIT 1 FORMAT Vertical Query id: ba19be7b-0891-4c07-a707-62e5e3eaa3c4 Row 1: ────── domain: levelup.gitconnected.com uri: /%EF%B8%8F-how-to-deploy-a-static-website-or-a-server-side-rendered-app-with-aws-amplify-hosting-806b4d620458 url: https://levelup.gitconnected.com/%EF%B8%8F-how-to-deploy-a-static-website-or-a-server-side-rendered-app-with-aws-amplify-hosting-806b4d620458?source=---------4----------------------- etag: W/"1c8ed-8WUFxf2KDh2sBbP4zrGTJCdxo4I" crawled_at: 2023-12-31 14:20:54 user_agent: Mozilla/5.0 (X11; Linux x86_64; Testbot/1.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36 resp_code: 200 nei_uris: ['https://levelup.gitconnected.com/?source=post_page-----806b4d620458--------------------------------'] nei_titles: [['Level Up Coding','Level Up Coding']] 1 row in set. Elapsed: 0.001 sec. Processed 22.91 thousand rows, 209.44 MB (30.17 million rows/s., 275.83 GB/s.) Peak memory usage: 15.02 KiB. Let’s check the stats view dd4a5ae26651 :) select domain, qnt from stats order by domain SELECT domain, qnt FROM stats ORDER BY domain ASC Query id: 9a2975ea-b52a-44ec-acae-698bb51f1973 ┌─domain───────────────────┬───qnt─┐ │ habr.com │ 12941 │ │ levelup.gitconnected.com │ 9295 │ │ w3.org │ 665 │ │ wikipedia.org │ 6 │ └──────────────────────────┴───────┘ 4 rows in set. Elapsed: 0.001 sec. … and in a couple of minutes dd4a5ae26651 :) select domain, qnt from stats order by domain SELECT domain, qnt FROM stats ORDER BY domain ASC Query id: 9a2975ea-b52a-44ec-acae-698bb51f1973 ┌─domain───────────────────┬───qnt─┐ │ habr.com │ 12947 │ │ levelup.gitconnected.com │ 9316 │ │ w3.org │ 665 │ │ wikipedia.org │ 6 │ └──────────────────────────┴───────┘ 4 rows in set. Elapsed: 0.001 sec. Conclusion A Refreshable Materialized View is quite handy and the most interesting feature of this release. It allows me to build useful data ETL pipelines in a very simple way just using SQL letting the ClickHouse to do the routine job for you. Let me know in the comments which of the new features excites you the most!