1,889 reads

ClickHouse+Python+Nginx: Quick tutorial how handle your logs

by Pavel BityukovOctober 20th, 2023

Too Long; Didn't Read

There’s nothing more important than having a good level of observability for the service and it’s not possible without reliable and fast storage for the application’s logs. One of the most popular solutions nowadays is ELK (ElasticSearch-Logstash-Kibana), but it is not as universal as it may seem.

featured image - ClickHouse+Python+Nginx: Quick tutorial how handle your logs

Motivation

Problem statement

ELK is a flexible, convenient, and complex solution to collect and analyze logs. It is scalable, robust, flexible query filtering, and universal. However, there are disadvantages to using ELK:

high memory and CPU resource consumption
Index and search speed degradation along with increasing number of records
full-text search index overhead
complicated QueryDSL

All of these make me wonder: is there an alternative to ELK when we talk about logs?

Here is my list of requirements for the log-handling solution:

fast and flexible aggregation
fast insertion, fast select
easy-to-setup and resource-minimal
grafana-compatible solution

However, it's nice to have a flexible and fast tool to do regular and ad-hoc analysis. Let's be more specific about the system that I’m going to implement:

Process 20M+ of log lines which are Nginx access log
Get the number of bad HTTP responses (4xx, 5xx)
Get bot and crawler counters and the URL they visit
Get top-5 IP addresses that are suspicious bots

What is ClickHouse and why did I decide to use it

ClickHouse is a high-performance column-oriented SQL database management system for online analytical processing as it’s declared.

The most important features for me were:

Column-oriented storage means that ClickHouse will read from disk only if needed.
ClickHouse can leverage all available resources (CPU cores and disks) to execute even a single query.
Data compression
SQL Support, last but not least

Technical Solution

Github repo tutorial

You can find the tutorial along with a log generator on GitHub.

Setup ClickHouse and connect

First of all, let’s create docker-compose.yml to define both services. (https://github.com/bp72/nginxlogprocessor/blob/init-commit/docker-compose.yml)

version: '3.6'

services:
  ch:
    image: clickhouse/clickhouse-server
    container_name: clickhouse
    restart: always
    volumes:
    - clickhousedata:/var/lib/clickhouse/
    ports:
    - '8123:8123'
    - '9000:9000'
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 262144
        hard: 262144
    # This capabilities prevents Docker from complaining about lack of those
    cap_add:
      - SYS_NICE
      - NET_ADMIN
      - IPC_LOCK

volumes:
  clickhousedata:

Let’s run it and check if everything is working: connect to the ClickHouse instance using ClickHouse client from the docker container and check if available via localhost and port

> docker-compose up -d
[+] Running 3/3
 ⠿ Network nginxlogprocessor_default  Created     0.1s
 ⠿ Container clickhouse               Started     0.6s

To connect to the instance I prefer ClickHouse client

docker-compose exec ch clickhouse-client
ClickHouse client version 23.9.1.1854 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 23.9.1 revision 54466.

a8c8da069d94 :)

Create databases and tables

Now, when everything is set, let’s create databases and tables: database to track processed logs and database for specific service, e.g. nginx, and first table nginx.access .

One of the significant advantages of ClickHouse is SQL syntax for definitions and queries. It’s not strictly SQL-standard, but very close to it.

CREATE DATABASE IF NOT EXISTS nginx
CREATE DATABASE IF NOT EXISTS logs

CREATE TABLE IF NOT EXISTS nginx.access (
  reqid String,
  ts DateTime64(3),               
  level Enum(''=0, 'debug'=1, 'info'=2, 'warn'=3 ,'error'=4),
  domain String,
  uri String,
  ua String,                          
  ref String,
  is_bot Boolean,
  is_mobile Boolean,
  is_tablet Boolean,
  is_pc Boolean,
  client String,
  duration Float32,
  response_code UInt16,
  addrIPv4 Nullable(IPv4),
  addrIPv6 Nullable(IPv6),
  upstream_connect_time Float32,
  upstream_header_time Float32,
  upstream_response_time Float32
)
ENGINE MergeTree
PRIMARY KEY reqid
ORDER BY reqid

CREATE TABLE IF NOT EXISTS logs.logfiles (
    filename String
)
ENGINE MergeTree
PRIMARY KEY filename
ORDER BY filename

Closer look at CREATE TABLE statement, you can see slightly different types and completely new types like Enum and IPv4. ClickHouse tries to reduce resource usage and to do that he optimizes it with cool features like Enum. It’s basically a key-value mapping of a string to an int of 8bit or 16bit, which automatically converts on insertion it converts string value to int, and on selection converts the opposite way (link).

IPv4, IPv6 are the special types to store addresses in the most optimal way (as unsigned int) and represent those in a human-readable way, so basically at the insert time you provide a string representation of IP-addr and ClickHouse does everything for you: stores it as int and server unpacked on selection.

Log Insertion

ClickHouse ideology is to insert fast. In order to do that, ClickHouse handles batch inserts better than one by one.

So the insertion script is not very complicated. readFile function yields data chunks of 50k records at max for the ClickHouse client to insert. Each chunk item represents the list of values relative to column names in the cols list

# it's not an actual code.
# the working implementation you can find at https://github.com/bp72/nginxlogprocessor

import clickhouse_connect

from config import CLICKHOUSE_HOST, CLICKHOUSE_PORT
from log import log

client = clickhouse_connect.get_client(host=CLICKHOUSE_HOST, port=CLICKHOUSE_PORT)


def loadToClickHouse(client, chunk):
    cols = [
        'reqid',
        'ts',
        'level',
        'domain',
        'uri',
        'ua', 
        'ref',
        'is_bot',
        'is_mobile',
        'is_tablet',
        'is_pc',
        'client',
        'duration',
        'response_code',
        'addrIPv4',
        'addrIPv6',
        'upstream_connect_time',
        'upstream_header_time',
        'upstream_response_time',
    ]
    client.insert('nginx.access', chunk, column_names=cols)


def processFeed(feed, client, chunk_size=10_000):
    total = 0
    for chunk in readFile(feed, chunk_size=chunk_size):    
        total += len(chunk)
        loadToClickHouse(client, chunk=chunk)
        log.info(f'process {feed=} inserted={len(chunk)} {total=}')

Actual execution and timing I have on my PC, you can see that parsing and insertion of the 800k records file took 21s of Python execution time. Not bad!

> .venv/bin/python ./main.py
I:2023-10-15 12:44:02 [18764] f=transport.py:1893 Connected (version 2.0, client OpenSSH_8.9p1)
I:2023-10-15 12:44:02 [18764] f=transport.py:1893 Authentication (publickey) successful!
I:2023-10-15 12:44:02 [18764] f=fetcher.py:14 connect host='*.*.*.*' port=22 user='root' password=None
I:2023-10-15 12:44:02 [18764] f=fetcher.py:18 run cmd='ls /var/log/nginx/*access*.log-*'
I:2023-10-15 12:44:02 [18764] f=fetcher.py:34 download src=/var/log/nginx/access.log-2023100812.gz dst=/tmp/access.log-2023100812.gz
I:2023-10-15 12:44:07 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=50000
I:2023-10-15 12:44:08 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=100000
I:2023-10-15 12:44:10 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=150000
I:2023-10-15 12:44:11 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=200000
I:2023-10-15 12:44:13 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=250000
I:2023-10-15 12:44:14 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=300000
I:2023-10-15 12:44:15 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=350000
I:2023-10-15 12:44:17 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=400000
I:2023-10-15 12:44:18 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=450000
I:2023-10-15 12:44:20 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=500000
I:2023-10-15 12:44:21 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=550000
I:2023-10-15 12:44:23 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=600000
I:2023-10-15 12:44:24 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=650000
I:2023-10-15 12:44:25 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=700000
I:2023-10-15 12:44:27 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=750000
I:2023-10-15 12:44:28 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=50000 total=800000
I:2023-10-15 12:44:28 [18764] f=main.py:20 process feed='/tmp/access.log-2023100812.gz' inserted=2190 total=802190
I:2023-10-15 12:44:28 [18764] f=fetcher.py:34 download src=/var/log/nginx/access.log-2023100814.gz dst=/tmp/access.log-2023100814.gz
I:2023-10-15 12:44:31 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=50000 total=50000
I:2023-10-15 12:44:32 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=50000 total=100000
I:2023-10-15 12:44:33 [18764] f=main.py:20 process feed='/tmp/access.log-2023100814.gz' inserted=30067 total=130067

Log analysis and problem detection

ClickHouse uses SQL to query the DB which is very comfortable for most software engineers and intuitively simple.

Let’s start with checking the number of records we have, it’s 22M.

a8c8da069d94 :) select count(1) from nginx.access;

SELECT count(1)
FROM nginx.access

Query id: f94881f3-2a7d-4039-9646-a6f614adb46c

┌──count()─┐
│ 22863822 │
└──────────┘

It’s easy to have queries with different breakdowns, which might be useful for problem detection and solution, for example, I would like to know from which IP address the host is being scanned for vulnerability.

This query demonstrates how flexible data queries can be compared to the ELK. WITH .. AS statement and IN / NOT IN, subquery, aggregation, and filtration make the ClickHouse very convenient.

a8c8da069d94 :) with baduri as (select uri, count(1) from nginx.access where response_code = 404 and uri not in ('/about/', '/favicon.ico') group by 1 having count(1) > 3 order by 2 desc limit 10) select IPv4NumToStringClassC(addrIPv4), count(1) from nginx.access where uri in (select uri from baduri) and addrIPv4 is not null group by 1 order by 2 desc limit 5

WITH baduri AS
    (
        SELECT
            uri,
            count(1)
        FROM nginx.access
        WHERE (response_code = 404) AND (uri NOT IN ('/about/', '/favicon.ico'))
        GROUP BY 1
        HAVING count(1) > 3
        ORDER BY 2 DESC
        LIMIT 10
    )
SELECT
    IPv4NumToStringClassC(addrIPv4),
    count(1)
FROM nginx.access
WHERE (uri IN (
    SELECT uri
    FROM baduri
)) AND (addrIPv4 IS NOT NULL)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

Query id: cf9bea33-212b-4c58-b6af-8e0aaae50b83

┌─IPv4NumToStringClassC(addrIPv4)─┬─count()─┐
│ 8.219.64.xxx                    │     961 │
│ 178.128.220.xxx                 │     378 │
│ 103.231.78.xxx                  │     338 │
│ 157.245.200.xxx                 │     324 │
│ 116.203.28.xxx                  │     260 │
└─────────────────────────────────┴─────────┘

5 rows in set. Elapsed: 0.150 sec. Processed 45.73 million rows, 1.81 GB (303.88 million rows/s., 12.01 GB/s.)
Peak memory usage: 307.49 MiB.

Let’s get the top 5 most popular uri per domain. This query uses the handy LIMIT x BY <field> function.

a8c8da069d94 :) select domain, uri, count(1) from nginx.access where domain in ('example.com', 'nestfromthebest.com', 'az.org') group by 1, 2 order by 1, 3 desc limit 5 by domain

SELECT
    domain,
    uri,
    count(1)
FROM nginx.access
WHERE domain IN ('example.com', 'nestfromthebest.com', 'az.org')
GROUP BY
    1,
    2
ORDER BY
    1 ASC,
    3 DESC
LIMIT 5 BY domain

Query id: 2acd328c-ed82-4d36-916b-8f2ecf764a9d

┌─domain──────┬─uri────────────┬─count()─┐
│ az.org      │ /about/        │  382543 │
│ az.org      │ /contacts/     │   42066 │
│ az.org      │ /category/id7  │    2722 │
│ az.org      │ /category/id14 │    2704 │
│ az.org      │ /category/id2  │    2699 │
│ example.com │ /about/        │  381653 │
│ example.com │ /contacts/     │   42023 │
│ example.com │ /category/id2  │    2694 │
│ example.com │ /category/id8  │    2688 │
│ example.com │ /category/id13 │    2670 │
└─────────────┴────────────────┴─────────┘
┌─domain──────────────┬─uri────────────┬─count()─┐
│ nestfromthebest.com │ /about/        │  383377 │
│ nestfromthebest.com │ /contacts/     │   42100 │
│ nestfromthebest.com │ /category/id8  │    2726 │
│ nestfromthebest.com │ /category/id14 │    2700 │
│ nestfromthebest.com │ /category/id4  │    2696 │
└─────────────────────┴────────────────┴─────────┘

15 rows in set. Elapsed: 0.062 sec. Processed 23.97 million rows, 918.43 MB (388.35 million rows/s., 14.88 GB/s.)
Peak memory usage: 98.67 MiB.

Conclusion

ClickHouse is a great tool to store and manipulate specific data like logs at a large scale. It is definitely worth further learning, and understanding, for example, nested data structure, sampling tooling, window functions, and others

I hope you enjoyed this small article and was useful for you!

L O A D I N G
. . . comments & more!

About Author

Pavel Bityukov@pbityukov

Backend Software Engineer.

Read my stories