Small-scale Logs Processing and Analysis

Written by pzinovkin | Published 2022/01/02
Tech Story Tags: logs | analytics | clickhouse | sql | logging | software-development | web-development | debugging

TLDROccasionally you need to process some HTTP server logs and extract analytical data from them. It can happen that you don’t have an enterprise-grade log processing pipeline available, and it’s good to know how to process and extract data from logs manually, using tools that are already available or easy to bootstrap.via the TL;DR App

Occasionally you need to process some HTTP server logs and extract analytical data from them. It can happen that you don’t have an enterprise-grade log processing pipeline available, and it’s good to know how to process and extract data from logs manually, using tools that are already available or easy to bootstrap.

Before going into details, there are two moments you should be aware of:

Storage Space

Logs are usually heavily compressed and can take a lot of space on your drive after decompression. For example, if you have a 10GiB archive with logs, expect it to take up to 5-7 times more space after decompression. And considering that method of processing can require additional space, I prefer to have up to 7-10 times more free space than compressed logs.

The rule here is simple - the less space you have, the more time and effort it will take to process logs. There is also an option to decompress logs on the fly, but it will be slower than working with uncompressed files.

Storage speed

It is all about I/O when reading a lot of logs - the faster the storage is, the better. If your laptop or computer doesn't have an SSD and you plan to process tens of gigabytes of logs, it can take a lot of time.

To measure how fast your storage is in a Linux environment, you can use the dd command (in my case, it's gdd from coreutils on macOS). These numbers can tell you the best possible speed of processing.

gdd if=/dev/zero bs=1024k of=testfile count=1024 oflag=dsync && \
  gdd if=testfile bs=1024k of=/dev/null count=1024 oflag=dsync && \
  rm testfile

1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.987115 s, 1.1 GB/s // Writing
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.147887 s, 7.3 GB/s // Reading

Log Formats

There are a lot of different log formats in the wild. One of the most widespread is the Common Log Format(CLF) - it is the default log format for Apache and Nginx HTTP servers. It's a plain text fixed format with the position of each field defined by a server's configuration. While this format is compact, it's not easy to read when unfamiliar with it, or the server's configuration differs from defaults. It can also be difficult to read by a machine. Here is an example of the Nginx server configuration and the log it produces:

log_format combined '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent"';

...

127.0.0.1 - - [19/Nov/2021:14:32:11 +0000] "GET / HTTP/1.1" 200 1664 "-" "curl/7.77.0"
127.0.0.1 - - [19/Nov/2021:14:32:11 +0000] "GET /teapot HTTP/1.1" 418 0 "-" "curl/7.77.0"
127.0.0.1 - - [19/Nov/2021:14:32:11 +0000] "GET /foo HTTP/1.1" 404 153 "-" "curl/7.77.0"

Another format you may encounter is structured logs in JSON format. This format is simple to read by humans and machines. It also can be parsed by most programming languages. Example of the Nginx configuration and logs in JSON:

log_format combined-json escape=json
  '{'
      '"remote_addr":"$remote_addr",'
      '"remote_user":"$remote_user",'
      '"time_local":"$time_local",'
      '"request":"$request",'
      '"status": "$status",'
      '"body_bytes_sent":"$body_bytes_sent",'
      '"http_referrer":"$http_referer",'
      '"http_user_agent":"$http_user_agent"'
  '}';

...

{"remote_addr":"127.0.0.1","remote_user":"","time_local":"19/Nov/2021:14:31:28 +0000","request":"GET / HTTP/1.1","status": "200","body_bytes_sent":"1664","http_referrer":"","http_user_agent":"curl/7.77.0"}
{"remote_addr":"127.0.0.1","remote_user":"","time_local":"19/Nov/2021:14:31:28 +0000","request":"GET /teapot HTTP/1.1","status": "418","body_bytes_sent":"0","http_referrer":"","http_user_agent":"curl/7.77.0"}
{"remote_addr":"127.0.0.1","remote_user":"","time_local":"19/Nov/2021:14:31:28 +0000","request":"GET /foo HTTP/1.1","status": "404","body_bytes_sent":"153","http_referrer":"","http_user_agent":"curl/7.77.0"}

Command-line Tools

When you need to extract and calculate some data from logs, Linux command-line tools are the first thing that comes to mind. These tools (or utilities) are not created specifically to process logs but are very flexible and powerful when combined together. And what is convenient - they are already available. Here are some examples of how tools like grep, awk, and others could be used.

Find how many 405 Method Not Allowed responses are in a log file.

time grep "\" 405 " access.log | wc -l
   11305

________________________________________________________
Executed in   58.56 secs    fish           external
   usr time   57.02 secs    0.27 millis   57.02 secs
   sys time    1.22 secs    1.32 millis    1.22 secs

Notice extra characters around 405 - because CLF lacks structure, the pattern specifies value but also tries to limit where to look for it in a string. Same grep but without limiting returns a different result which is incorrect.

grep "405" access.log | wc -l # Wrong result!
    8824

Count how many of each status code is in a file, sorted by count in descending order. Output also shows another problem with CLF - splitting by space does not always work as expected, and you can get some amount of garbage in output. But for most cases, this should be ok.

time awk '{print $9}' access.log | sort | uniq -c | sort -nr
21510839 404
889766 200
115696 400
14718 304
11305 405
8777 206
3948 416
3146 403
2908 411
1097 HTTP/1.0\" # Looks like log lines contained space in $remote_user field!
  26 /vicidial/admin.php
  26 /scripts/admin.php
  26 /cgi-bin/admin.php
  26 /admin.php
   7 499
   4 412

________________________________________________________
Executed in   24.48 secs    fish           external
   usr time   30.88 secs    0.39 millis   30.88 secs
   sys time    1.99 secs    1.68 millis    1.99 secs

As you can see, a command to select and aggregate data quickly becomes more and more complex. It also becomes difficult to debug, and some of the commands take a long time to complete. Anyway, it’s a good approach for simple one-off queries when working with a small number of logs.

But if you are looking at 10+ GiB of logs and want to run rather complex queries, there is a better approach - convert semi-structured data to well-structured and use a database to run queries.

Database

The database is a great tool to work with data:

  • It works with well-structured data, so no more garbage in results;

  • Most databases use the same old SQL to retrieve the data;

  • The database is pretty easy to set up when in a non-production environment;

  • Queries generally run faster than reading a file from a disk because of how data is stored.

Most developers are familiar with MySQL and PostgreSQL. They are great RDBMS and can be used to run analytical queries with some limitations. It’s just that most relational databases are not really designed to run queries on tens of millions of rows. However, there are databases specially optimized for this scenario - column-oriented DBMS. One good example is of such a database is ClickHouse. Let’s try it.

To load logs to the database they need to be in a structured format that the database can understand. It means that logs in JSON can be directly loaded into the database. CLF logs need to be converted to CSV format. I find that the best way to do that is to use Perl and regular expression:

perl -ne 'print if s/^(.*?) - .*? \[([^:]*?):(.*?) .*?\] "(.*?) (.*?) (HTTP\/.*?)" (.*?) (.*?) "(.*?)" "(.*?)"$/$1,"$2 $3",$4,"$5",$6,$7,$8,"$9","$10"/g' < access.log > access.csv

For the purpose of running some analytical queries on your laptop you don’t need to invest a lot of effort into configuring a database server - just download the server and run it from a command line.

# MacOS x86_64
curl -O 'https://builds.clickhouse.com/master/macos/clickhouse' && chmod a+x ./clickhouse

# Create directory for ClickHouse data files/directories
mkdir data

# Clickhouse will use the current directory to create files
cd data

# Start the server
./../clickhouse server
Processing configuration file 'config.xml'.
There is no file 'config.xml', will use embedded config.
Logging trace to console
2021.12.31 19:49:27.682457 [ 3587768 ] {} <Information> : Starting ClickHouse 21.12.1.8928 with revision 54457, no build id, PID 14099

In the new console connect to the server and create a database and table for our log file.

./clickhouse client --multiline
ClickHouse client version 21.12.1.8928 (official build).
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 21.12.1 revision 54451.

:) CREATE DATABASE IF NOT EXISTS logs;

CREATE DATABASE IF NOT EXISTS logs

Query id: 3c376c00-1969-41df-af0c-099174e2e32b

Ok.

0 rows in set. Elapsed: 0.002 sec.

:) CREATE TABLE IF NOT EXISTS logs.access
:-] (
:-]     remote_addr String,
:-]     time_local DateTime,
:-]     request_method String,
:-]     request_uri String,
:-]     http_version String,
:-]     status Int32,
:-]     body_bytes_sent Int64,
:-]     http_referrer String,
:-]     http_user_agent String
:-] )
:-] ENGINE = MergeTree()
:-] ORDER BY tuple();

CREATE TABLE IF NOT EXISTS logs.access
(
    `remote_addr` String,
    `time_local` DateTime,
    `request_method` String,
    `request_uri` String,
    `http_version` String,
    `status` Int32,
    `body_bytes_sent` Int64,
    `http_referrer` String,
    `http_user_agent` String
)
ENGINE = MergeTree
ORDER BY tuple()

Query id: 501fa6b3-a16e-4c97-bb40-2d7d58b997ba

Ok.

0 rows in set. Elapsed: 0.005 sec.

:) ^C
:) Bye.

Now let’s load our data and run some queries.

# Load data to logs.access table from csv file
./clickhouse client --date_time_input_format='best_effort' --query="INSERT INTO logs.access FORMAT CSV" < access.csv

Find how many 405 Method Not Allowed responses there are

:) SELECT count() FROM logs.access WHERE status=405;

SELECT count()
FROM logs.access
WHERE status = 405

Query id: 76b40738-354c-4ad0-a815-6bf9e211fb54

┌─count()─┐
│   11305 │
└─────────┘

1 rows in set. Elapsed: 0.037 sec. Processed 22.56 million rows, 90.25 MB (607.73 million rows/s., 2.43 GB/s.)

Count how many of each status code there are, sorted by count in descending order.

:) SELECT count(), status FROM logs.access GROUP BY status ORDER BY count() DESC;

SELECT
    count(),
    status
FROM logs.access
GROUP BY status
ORDER BY count() DESC

Query id: f9d01465-54ca-4e73-a961-be2f695ded91

┌──count()─┬─status─┐
│ 21512196 │    404 │
│   889766 │    200 │
│   115800 │    400 │
│    14718 │    304 │
│    11305 │    405 │
│     8777 │    206 │
│     3948 │    416 │
│     3146 │    403 │
│     2908 │    411 │
│        7 │    499 │
│        4 │    412 │
└──────────┴────────┘

11 rows in set. Elapsed: 0.045 sec. Processed 22.56 million rows, 90.25 MB (502.02 million rows/s., 2.01 GB/s.)

Now you can play with data without waiting endlessly for each query to complete. Enjoy!


Written by pzinovkin | ...
Published by HackerNoon on 2022/01/02