I previously worked as a data analyst for a web analytics service, which assists websites in collecting and analyzing customer behavior data.
It is evident that data holds immense value for web analytics services. One of my primary objectives was to ensure the monitoring of data quality.
To ensure that everything is okay with the collected data, we need to keep an eye on two things:
Today, I would like to tell you about my experience tackling this complex task. As a bonus, I will show examples of ClickHouse array functions.
What is web analytics?
How to ensure data quality
Categorical variables
And the winner is…
L1 norm calculation
Alerting & monitoring
Web analytics systems log a lot of information about events on the site, for example, what browser and operating system the customers used, what URLs they visited, how much time they spent on the site, and even what products they added to the cart and purchased. All this data can be used for reporting (to know how many customers visited the site) or analysis (to understand pain points and improve customer experience).
We will use ClickHouse’s anonymized web analytics data. The guide describing how to load it can be found
Let’s look at the data. VisitID
is a unique identifier of a session, while other parameters are characteristics of this session. UserAgent
and OS
look like numeric variables, but they have encoded names of browsers and operational systems. It’s much more efficient to store these values like numbers and then decode values on the application level. Such optimizations really matter and can save you terabytes, if you’re working with big data.
SELECT
VisitID,
StartDate,
UTCStartTime,
Duration,
PageViews,
StartURLDomain,
IsMobile,
UserAgent,
OS
FROM datasets.visits_v1
FINAL
LIMIT 10
┌─────────────VisitID─┬──StartDate─┬────────UTCStartTime─┬─Duration─┬─PageViews─┬─StartURLDomain─────────┬─IsMobile─┬─UserAgent─┬──OS─┐
│ 6949594573706600954 │ 2014-03-17 │ 2014-03-17 11:38:42 │ 0 │ 1 │ gruzomoy.sumtel.com.ua │ 0 │ 7 │ 2 │
│ 7763399689682887827 │ 2014-03-17 │ 2014-03-17 18:22:20 │ 24 │ 3 │ gruzomoy.sumtel.com.ua │ 0 │ 2 │ 2 │
│ 9153706821504089082 │ 2014-03-17 │ 2014-03-17 09:41:09 │ 415 │ 9 │ gruzomoy.sumtel.com.ua │ 0 │ 7 │ 35 │
│ 5747643029332244007 │ 2014-03-17 │ 2014-03-17 04:46:08 │ 19 │ 1 │ gruzomoy.sumtel.com.ua │ 0 │ 2 │ 238 │
│ 5868920473837897470 │ 2014-03-17 │ 2014-03-17 10:10:31 │ 11 │ 1 │ gruzomoy.sumtel.com.ua │ 0 │ 3 │ 35 │
│ 6587050697748196290 │ 2014-03-17 │ 2014-03-17 09:06:47 │ 18 │ 2 │ gruzomoy.sumtel.com.ua │ 0 │ 120 │ 35 │
│ 8872348705743297525 │ 2014-03-17 │ 2014-03-17 06:40:43 │ 190 │ 6 │ gruzomoy.sumtel.com.ua │ 0 │ 5 │ 238 │
│ 8890846394730359529 │ 2014-03-17 │ 2014-03-17 02:27:19 │ 0 │ 1 │ gruzomoy.sumtel.com.ua │ 0 │ 57 │ 35 │
│ 7429587367586011403 │ 2014-03-17 │ 2014-03-17 01:13:14 │ 0 │ 1 │ gruzomoy.sumtel.com.ua │ 1 │ 1 │ 12 │
│ 5195928066127503662 │ 2014-03-17 │ 2014-03-17 01:43:02 │ 1926 │ 3 │ gruzomoy.sumtel.com.ua │ 0 │ 2 │ 35 │
└─────────────────────┴────────────┴─────────────────────┴──────────┴───────────┴────────────────────────┴──────────┴───────────┴─────┘
You may notice I’ve specified the final
modifier after the table name. I did this to ensure that data was fully merged and we got exactly one row for each session.
In ClickHouse CollapsingMergeTree
engine is often used because it allows using inserts
instead of updates
(more details as usual in final
modifier, we forced this process.
We can execute two simple queries to see the difference.
SELECT
uniqExact(VisitID) AS unique_sessions,
sum(Sign) AS number_sessions,
-- number of sessions after collapsing
count() AS rows
FROM datasets.visits_v1
┌─unique_sessions─┬─number_sessions─┬────rows─┐
│ 1676685 │ 1676581 │ 1680609 │
└─────────────────┴─────────────────┴─────────┘
SELECT
uniqExact(VisitID) AS unique_sessions,
sum(Sign) AS number_sessions,
count() AS rows
FROM datasets.visits_v1
FINAL
┌─unique_sessions─┬─number_sessions─┬────rows─┐
│ 1676685 │ 1676721 │ 1676721 │
└─────────────────┴─────────────────┴─────────┘
Using final
has its own drawbacks on performance.
Verifying that there are no missing or duplicate events is pretty straightforward. You can find a lot of methods to detect anomalies in time series data starting from naive approaches (for example, the number of events is within +20% or -20% compared to the previous week) to ML with libraries like
Data consistency is a bit trickier task. As I mentioned earlier, web analytics services track a lot of information about customers’ behavior on a site. They log hundreds of parameters, and we need to ensure that all these values look valid.
Parameters can be numeric (duration or the number of seen web pages) or categorical (browser or operating system). For numeric values, we can use statistical criteria to ensure that distribution stays the same — for example,
So after researching best practices, my only question was how to monitor consistency for categorical variables, and it’s time to talk about it.
Let’s use browsers as an example. There are 62 unique values for browsers in our data.
SELECT uniqExact(UserAgent) AS unique_browsers
FROM datasets.visits_v1
┌─unique_browsers─┐
│ 62 │
└─────────────────┘
SELECT
UserAgent,
count() AS sessions,
round((100. * sessions) / (
SELECT count()
FROM datasets.visits_v1
FINAL
), 2) AS sessions_share
FROM datasets.visits_v1
FINAL
GROUP BY 1
HAVING sessions_share >= 1
ORDER BY sessions_share DESC
┌─UserAgent─┬─sessions─┬─sessions_share─┐
│ 7 │ 493225 │ 29.42 │
│ 2 │ 236929 │ 14.13 │
│ 3 │ 235439 │ 14.04 │
│ 4 │ 196628 │ 11.73 │
│ 120 │ 154012 │ 9.19 │
│ 50 │ 86381 │ 5.15 │
│ 79 │ 63082 │ 3.76 │
│ 121 │ 50245 │ 3 │
│ 1 │ 48688 │ 2.9 │
│ 42 │ 21040 │ 1.25 │
│ 5 │ 20399 │ 1.22 │
│ 71 │ 19893 │ 1.19 │
└───────────┴──────────┴────────────────┘
We can monitor shares for each browser separately as numeric variables, but in that case, we will monitor at least 12-time series for one field, UserAgent
. Every person who has done at least one alert knows that the fewer variables we monitor - the better. When you are tracking many parameters, you have a lot of false positive notifications that you need to handle.
So I started to think about one metric that shows the difference between distributions. The idea is to compare browsers’ shares now (T2
) and before (T1
). We can choose the previous period depending on granularity:
Let’s look at the following example.
My first idea was to look at a heuristic metric similar to the L1 norm used in machine learning.
For the example above, this formula will give us the following result — 10%. Actually, this metric has a meaning — it shows the minimal share of events in distribution for which the browser has changed.
Afterward, I discussed this topic with my boss, who has quite a lot of experience in data science. He advised me to look at Kullback-Leibler or Jensen-Shannon divergences since it’s a more valid approach to calculating distances between probability distributions.
If you don’t remember these metrics or have never heard of them before, don’t worry, I was in your shoes. So I googled the formulas (__this article __thoroughly explains the concepts) and calculated values for our example.
import numpy as np
prev = np.array([0.7, 0.2, 0.1])
curr = np.array([0.6, 0.27, 0.13])
def get_kl_divergence(prev, curr):
kl = prev * np.log(prev / curr)
return np.sum(kl)
def get_js_divergence(prev, curr):
mean = (prev + curr)/2
return 0.5*(get_kl_divergence(prev, mean) + get_kl_divergence(curr, mean))
kl = get_kl_divergence(prev, curr)
js = get_js_divergence(prev, curr)
print('KL divergence = %.4f, JS divergence = %.4f' % (kl, js))
# KL divergence = 0.0216, JS divergence = 0.0055
As you can see, the distances we’ve calculated differ significantly. So now we have (at least) three ways to calculate the difference between browser shares before and now, and the next question is which way to choose for our monitoring task.
The best way to estimate the performance of different methods is to see how they would perform in real-life situations. To do it we can simulate anomalies in data and compare effects.
There are two common cases of anomalies in data:
We can take actual browser shares and simulate these anomalies. For simplicity, I will group all browsers with shares below 5% into the group browser = 0
.
WITH browsers AS
(
SELECT
UserAgent,
count() AS raw_sessions,
(100. * count()) / (
SELECT count()
FROM datasets.visits_v1
FINAL
) AS raw_sessions_share
FROM datasets.visits_v1
FINAL
GROUP BY 1
)
SELECT
if(raw_sessions_share >= 5, UserAgent, 0) AS browser,
sum(raw_sessions) AS sessions,
round(sum(raw_sessions_share), 2) AS sessions_share
FROM browsers
GROUP BY browser
ORDER BY sessions DESC
┌─browser─┬─sessions─┬─sessions_share─┐
│ 7 │ 493225 │ 29.42 │
│ 0 │ 274107 │ 16.35 │
│ 2 │ 236929 │ 14.13 │
│ 3 │ 235439 │ 14.04 │
│ 4 │ 196628 │ 11.73 │
│ 120 │ 154012 │ 9.19 │
│ 50 │ 86381 │ 5.15 │
└─────────┴──────────┴────────────────┘
It’s time to simulate these two scenarios. You can find all code on
As a result of the simulation, we got two graphs that show correlations between factual effect and our distance metrics.
Each point in the graph shows the result of one simulation — actual effect and corresponding distance.
You can easily see that L1 norm is the best metric for our task because it’s the closest to the line distance = share of affected events
. Kullback-Leibler and Jensen-Shannon divergences differ a lot and have different levels depending on use cases (what browser is losing traffic).
Such metrics don’t suit monitoring because you won’t be able to specify a threshold that will alert you in case more than 5% of traffic has been affected. Also, we can’t easily interpret these metrics, while L1 norm exactly shows the extent of the anomaly.
Now we know what metric will show us the consistency of data, and the last task left is to implement L1 norm calculations in the database.
We can use widely-known window functions for it.
with browsers as (
select
UserAgent as param,
multiIf(
toStartOfHour(UTCStartTime) = '2014-03-18 12:00:00', 'previous',
toStartOfHour(UTCStartTime) = '2014-03-18 13:00:00', 'current',
'other'
) as event_time,
sum(Sign) as events
from datasets.visits_v1
where (StartDate = '2014-03-18')
-- filter by partition key is a good practice
and (event_time != 'other')
group by param, event_time)
select
sum(abs_diff)/2 as l1_norm
from
(select
param,
sumIf(share, event_time = 'current') as curr_share,
sumIf(share, event_time = 'previous') as prev_share,
abs(curr_share - prev_share) as abs_diff
from
(select
param,
event_time,
events,
sum(events) over (partition by event_time) as total_events,
events/total_events as share
from browsers)
group by param)
┌─────────────l1_norm─┐
│ 0.01515028932687386 │
└─────────────────────┘
ClickHouse has very powerful array functions, that I had been using for ages until window functions were supported. So I would like to show you the power of this tool.
with browsers as (
select
UserAgent as param,
multiIf(
toStartOfHour(UTCStartTime) = '2014-03-18 12:00:00', 'previous',
toStartOfHour(UTCStartTime) = '2014-03-18 13:00:00', 'current',
'other'
) as event_time,
sum(Sign) as events
from datasets.visits_v1
where StartDate = '2014-03-18' -- filter by partition key is a good practice
and event_time != 'other'
group by param, event_time
order by event_time, param)
select l1_norm
from
(select
-- aggregating all param values into arrays
groupArrayIf(param, event_time = 'current') as curr_params,
groupArrayIf(param, event_time = 'previous') as prev_params,
-- calculating params that are present in both time periods or only in one of them
arrayIntersect(curr_params, prev_params) as both_params,
arrayFilter(x -> not has(prev_params, x), curr_params) as only_curr_params,
arrayFilter(x -> not has(curr_params, x), prev_params) as only_prev_params,
-- aggregating all events into arrays
groupArrayIf(events, event_time = 'current') as curr_events,
groupArrayIf(events, event_time = 'previous') as prev_events,
-- calculating events shares
arrayMap(x -> x / arraySum(curr_events), curr_events) as curr_events_shares,
arrayMap(x -> x / arraySum(prev_events), prev_events) as prev_events_shares,
-- filtering shares for browsers that are present in both periods
arrayFilter(x, y -> has(both_params, y), curr_events_shares, curr_params) as both_curr_events_shares,
arrayFilter(x, y -> has(both_params, y), prev_events_shares, prev_params) as both_prev_events_shares,
-- filtering shares for browsers that are present only in one of periods
arrayFilter(x, y -> has(only_curr_params, y), curr_events_shares, curr_params) as only_curr_events_shares,
arrayFilter(x, y -> has(only_prev_params, y), prev_events_shares, prev_params) as only_prev_events_shares,
-- calculating the abs differences and l1 norm
arraySum(arrayMap(x, y -> abs(x - y), both_curr_events_shares, both_prev_events_shares)) as both_abs_diff,
1/2*(both_abs_diff + arraySum(only_curr_events_shares) + arraySum(only_prev_events_shares)) as l1_norm
from browsers)
┌─────────────l1_norm─┐
│ 0.01515028932687386 │
└─────────────────────┘
This approach may be convenient for people with a Pythonic mindset. With persistence and creativity, one can write any logic using array functions.
We have two queries that show us the fluctuations of browsers’ shares in our data. You can use this approach to monitor data that you are interested in.
The only bit left is to align with your team on alerting thresholds. I usually look at historical data and previous anomalies to pick up some initial levels and then constantly adjust them using new information: false positive alerts or missed anomalies.
Also, during the implementation of monitoring, I’ve faced a couple of nuances I would like to cover briefly:
UserID
or StartDate
, so choose what parameters to include wisely.StartURL
in web analytics data has more than 600K unique values. Calculating metrics for it can be resource-consuming. So I would suggest either bucketing these values (for example, take domain or TLD) or monitoring only top values and grouping others into the separate group “other”.
I hope this article was helpful to you, and now you have an idea of how to monitor data quality or at least how to write complex ClickHouse queries using array functions. Stay tuned!
Also published here.
The lead image for this article was generated by HackerNoon's AI Image Generator via the prompt "visualize data”.