Motivation There’s nothing more important than having a good level of observability for the service and it’s not possible without reliable and fast storage for the application’s logs. One of the most popular solutions nowadays is ELK ( lasticSearch- ogstash- ibana), but it is not as universal as it may seem. E L K Problem statement ELK is a flexible, convenient, and complex solution to collect and analyze logs. It is scalable, robust, flexible query filtering, and universal. However, there are disadvantages to using ELK: high memory and CPU resource consumption Index and search speed degradation along with increasing number of records full-text search index overhead complicated QueryDSL All of these make me wonder: is there an alternative to ELK when we talk about logs? Here is my list of requirements for the log-handling solution: fast and flexible aggregation fast insertion, fast select easy-to-setup and resource-minimal grafana-compatible solution However, it's nice to have a flexible and fast tool to do regular and ad-hoc analysis. Let's be more specific about the system that I’m going to implement: Process 20M+ of log lines which are Nginx access log Get the number of bad HTTP responses (4xx, 5xx) Get bot and crawler counters and the URL they visit Get top-5 IP addresses that are suspicious bots What is ClickHouse and why did I decide to use it ClickHouse is a high-performance column-oriented SQL database management system for online analytical processing as it’s declared. The most important features for me were: Column-oriented storage means that ClickHouse will read from disk only if needed. ClickHouse can leverage all available resources (CPU cores and disks) to execute even a single query. Data compression SQL Support, last but not least Technical Solution Github repo tutorial You can find the tutorial along with a log generator on . GitHub Setup ClickHouse and connect First of all, let’s create docker-compose.yml to define both services. ( ) https://github.com/bp72/nginxlogprocessor/blob/init-commit/docker-compose.yml version: '3.6' services: ch: image: clickhouse/clickhouse-server container_name: clickhouse restart: always volumes: - clickhousedata:/var/lib/clickhouse/ ports: - '8123:8123' - '9000:9000' ulimits: memlock: soft: -1 hard: -1 nofile: soft: 262144 hard: 262144 # This capabilities prevents Docker from complaining about lack of those cap_add: - SYS_NICE - NET_ADMIN - IPC_LOCK volumes: clickhousedata: Let’s run it and check if everything is working: connect to the ClickHouse instance using ClickHouse client from the docker container and check if available via localhost and port > docker-compose up -d [+] Running 3/3 ⠿ Network nginxlogprocessor_default Created 0.1s ⠿ Container clickhouse Started 0.6s To connect to the instance I prefer ClickHouse client docker-compose exec ch clickhouse-client ClickHouse client version 23.9.1.1854 (official build). Connecting to localhost:9000 as user default. Connected to ClickHouse server version 23.9.1 revision 54466. a8c8da069d94 :) Create databases and tables Now, when everything is set, let’s create databases and tables: database to track processed and database for specific service, e.g. , and first table . logs nginx nginx.access One of the significant advantages of ClickHouse is SQL syntax for definitions and queries. It’s not strictly SQL-standard, but very close to it. CREATE DATABASE IF NOT EXISTS nginx CREATE DATABASE IF NOT EXISTS logs CREATE TABLE IF NOT EXISTS nginx.access ( reqid String, ts DateTime64(3), level Enum(''=0, 'debug'=1, 'info'=2, 'warn'=3 ,'error'=4), domain String, uri String, ua String, ref String, is_bot Boolean, is_mobile Boolean, is_tablet Boolean, is_pc Boolean, client String, duration Float32, response_code UInt16, addrIPv4 Nullable(IPv4), addrIPv6 Nullable(IPv6), upstream_connect_time Float32, upstream_header_time Float32, upstream_response_time Float32 ) ENGINE MergeTree PRIMARY KEY reqid ORDER BY reqid CREATE TABLE IF NOT EXISTS logs.logfiles ( filename String ) ENGINE MergeTree PRIMARY KEY filename ORDER BY filename Closer look at statement, you can see slightly different types and completely new types like Enum and IPv4. ClickHouse tries to reduce resource usage and to do that he optimizes it with cool features like Enum. It’s basically a key-value mapping of a string to an int of 8bit or 16bit, which automatically converts on insertion it converts string value to int, and on selection converts the opposite way ( ). CREATE TABLE link IPv4, IPv6 are the special types to store addresses in the most optimal way (as unsigned int) and represent those in a human-readable way, so basically at the insert time you provide a string representation of IP-addr and ClickHouse does everything for you: stores it as int and server unpacked on selection. Log Insertion ClickHouse ideology is to insert fast. In order to do that, ClickHouse handles batch inserts better than one by one. So the insertion script is not very complicated. function yields data chunks of 50k records at max for the ClickHouse client to insert. Each chunk item represents the list of values relative to column names in the list readFile cols # it's not an actual code. # the working implementation you can find at https://github.com/bp72/nginxlogprocessor import clickhouse_connect from config import CLICKHOUSE_HOST, CLICKHOUSE_PORT from log import log client = clickhouse_connect.get_client(host=CLICKHOUSE_HOST, port=CLICKHOUSE_PORT) def loadToClickHouse(client, chunk): cols = [ 'reqid', 'ts', 'level', 'domain', 'uri', 'ua', 'ref', 'is_bot', 'is_mobile', 'is_tablet', 'is_pc', 'client', 'duration', 'response_code', 'addrIPv4', 'addrIPv6', 'upstream_connect_time', 'upstream_header_time', 'upstream_response_time', ] client.insert('nginx.access', chunk, column_names=cols) def processFeed(feed, client, chunk_size=10_000): total = 0 for chunk in readFile(feed, chunk_size=chunk_size): total += len(chunk) loadToClickHouse(client, chunk=chunk) log.info(f'process {feed=} inserted={len(chunk)} {total=}') Actual execution and timing I have on my PC, you can see that parsing and insertion of the 800k records file took 21s of Python execution time. Not bad! > .venv/bin/python ./main.py I:2023-10-15 12:44:02 [18764] f=transport.py:1893 Connected (version 2.0, client OpenSSH_8.9p1) I:2023-10-15 12:44:02 [18764] f=transport.py:1893 Authentication (publickey) successful! I:2023-10-15 12:44:02 [18764] f=fetcher.py:14 connect host='*.*.*.*' port=22 user='root' password=None I:2023-10-15 12:44:02 [18764] f=fetcher.py:18 run cmd='ls /var/log/nginx/*access*.log-*' I:2023-10-15 12:44:02 [18764] f=fetcher.py:34 download src=/var/log/nginx/access.log-2023100812.gz dst=/tmp/access.log-2023100812.gz I:2023-10-15 12:44:07 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=50000 I:2023-10-15 12:44:08 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=100000 I:2023-10-15 12:44:10 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=150000 I:2023-10-15 12:44:11 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=200000 I:2023-10-15 12:44:13 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=250000 I:2023-10-15 12:44:14 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=300000 I:2023-10-15 12:44:15 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=350000 I:2023-10-15 12:44:17 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=400000 I:2023-10-15 12:44:18 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=450000 I:2023-10-15 12:44:20 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=500000 I:2023-10-15 12:44:21 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=550000 I:2023-10-15 12:44:23 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=600000 I:2023-10-15 12:44:24 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=650000 I:2023-10-15 12:44:25 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=700000 I:2023-10-15 12:44:27 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=750000 I:2023-10-15 12:44:28 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=800000 I:2023-10-15 12:44:28 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=2190 total=802190 I:2023-10-15 12:44:28 [18764] f=fetcher.py:34 download src=/var/log/nginx/access.log-2023100814.gz dst=/tmp/access.log-2023100814.gz I:2023-10-15 12:44:31 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=50000 total=50000 I:2023-10-15 12:44:32 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=50000 total=100000 I:2023-10-15 12:44:33 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=30067 total=130067 Log analysis and problem detection ClickHouse uses SQL to query the DB which is very comfortable for most software engineers and intuitively simple. Let’s start with checking the number of records we have, it’s 22M. a8c8da069d94 :) select count(1) from nginx.access; SELECT count(1) FROM nginx.access Query id: f94881f3-2a7d-4039-9646-a6f614adb46c ┌──count()─┐ │ 22863822 │ └──────────┘ It’s easy to have queries with different breakdowns, which might be useful for problem detection and solution, for example, I would like to know from which IP address the host is being scanned for vulnerability. This query demonstrates how flexible data queries can be compared to the ELK. WITH .. AS statement and IN / NOT IN, subquery, aggregation, and filtration make the ClickHouse very convenient. a8c8da069d94 :) with baduri as (select uri, count(1) from nginx.access where response_code = 404 and uri not in ('/about/', '/favicon.ico') group by 1 having count(1) > 3 order by 2 desc limit 10) select IPv4NumToStringClassC(addrIPv4), count(1) from nginx.access where uri in (select uri from baduri) and addrIPv4 is not null group by 1 order by 2 desc limit 5 WITH baduri AS ( SELECT uri, count(1) FROM nginx.access WHERE (response_code = 404) AND (uri NOT IN ('/about/', '/favicon.ico')) GROUP BY 1 HAVING count(1) > 3 ORDER BY 2 DESC LIMIT 10 ) SELECT IPv4NumToStringClassC(addrIPv4), count(1) FROM nginx.access WHERE (uri IN ( SELECT uri FROM baduri )) AND (addrIPv4 IS NOT NULL) GROUP BY 1 ORDER BY 2 DESC LIMIT 5 Query id: cf9bea33-212b-4c58-b6af-8e0aaae50b83 ┌─IPv4NumToStringClassC(addrIPv4)─┬─count()─┐ │ 8.219.64.xxx │ 961 │ │ 178.128.220.xxx │ 378 │ │ 103.231.78.xxx │ 338 │ │ 157.245.200.xxx │ 324 │ │ 116.203.28.xxx │ 260 │ └─────────────────────────────────┴─────────┘ 5 rows in set. Elapsed: 0.150 sec. Processed 45.73 million rows, 1.81 GB (303.88 million rows/s., 12.01 GB/s.) Peak memory usage: 307.49 MiB. Let’s get the top 5 most popular uri per domain. This query uses the handy LIMIT x BY <field> function. a8c8da069d94 :) select domain, uri, count(1) from nginx.access where domain in ('example.com', 'nestfromthebest.com', 'az.org') group by 1, 2 order by 1, 3 desc limit 5 by domain SELECT domain, uri, count(1) FROM nginx.access WHERE domain IN ('example.com', 'nestfromthebest.com', 'az.org') GROUP BY 1, 2 ORDER BY 1 ASC, 3 DESC LIMIT 5 BY domain Query id: 2acd328c-ed82-4d36-916b-8f2ecf764a9d ┌─domain──────┬─uri────────────┬─count()─┐ │ az.org │ /about/ │ 382543 │ │ az.org │ /contacts/ │ 42066 │ │ az.org │ /category/id7 │ 2722 │ │ az.org │ /category/id14 │ 2704 │ │ az.org │ /category/id2 │ 2699 │ │ example.com │ /about/ │ 381653 │ │ example.com │ /contacts/ │ 42023 │ │ example.com │ /category/id2 │ 2694 │ │ example.com │ /category/id8 │ 2688 │ │ example.com │ /category/id13 │ 2670 │ └─────────────┴────────────────┴─────────┘ ┌─domain──────────────┬─uri────────────┬─count()─┐ │ nestfromthebest.com │ /about/ │ 383377 │ │ nestfromthebest.com │ /contacts/ │ 42100 │ │ nestfromthebest.com │ /category/id8 │ 2726 │ │ nestfromthebest.com │ /category/id14 │ 2700 │ │ nestfromthebest.com │ /category/id4 │ 2696 │ └─────────────────────┴────────────────┴─────────┘ 15 rows in set. Elapsed: 0.062 sec. Processed 23.97 million rows, 918.43 MB (388.35 million rows/s., 14.88 GB/s.) Peak memory usage: 98.67 MiB. Conclusion ClickHouse is a great tool to store and manipulate specific data like logs at a large scale. It is definitely worth further learning, and understanding, for example, nested data structure, sampling tooling, window functions, and others I hope you enjoyed this small article and was useful for you!