Strategies and Best Practices for Ensuring Data Consistencyby@miptgirl
155 reads

Strategies and Best Practices for Ensuring Data Consistency

by Mariia MansurovaJune 28th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

I previously worked as a data analyst for a [web analytics]( service. One of my primary objectives was to ensure the monitoring of data quality. We will use ClickHouse’s anonymized web analytics data.
featured image - Strategies and Best Practices for Ensuring Data Consistency
Mariia Mansurova HackerNoon profile picture

I previously worked as a data analyst for a web analytics service, which assists websites in collecting and analyzing customer behavior data.

It is evident that data holds immense value for web analytics services. One of my primary objectives was to ensure the monitoring of data quality.

To ensure that everything is okay with the collected data, we need to keep an eye on two things:

  • There are no missing or duplicate events -> the number of events and sessions is within the expected range.
  • Data is correct -> for each parameter distribution of values stays the same, and yet another release hadn’t started logging all browsers as Safari or stopped tracking purchases completely.

Today, I would like to tell you about my experience tackling this complex task. As a bonus, I will show examples of ClickHouse array functions.

Content Overview

  • What is web analytics?

  • How to ensure data quality

  • Categorical variables

  • And the winner is…

  • L1 norm calculation

  • Alerting & monitoring

What is web analytics?

Web analytics systems log a lot of information about events on the site, for example, what browser and operating system the customers used, what URLs they visited, how much time they spent on the site, and even what products they added to the cart and purchased. All this data can be used for reporting (to know how many customers visited the site) or analysis (to understand pain points and improve customer experience).

We will use ClickHouse’s anonymized web analytics data. The guide describing how to load it can be found here.

Let’s look at the data. VisitID is a unique identifier of a session, while other parameters are characteristics of this session. UserAgent and OS look like numeric variables, but they have encoded names of browsers and operational systems. It’s much more efficient to store these values like numbers and then decode values on the application level. Such optimizations really matter and can save you terabytes, if you’re working with big data.

FROM datasets.visits_v1

│ 6949594573706600954 │ 2014-03-17 │ 2014-03-17 11:38:42 │        0 │         1 │ │        0 │         7 │   2 │
│ 7763399689682887827 │ 2014-03-17 │ 2014-03-17 18:22:20 │       24 │         3 │ │        0 │         2 │   2 │
│ 9153706821504089082 │ 2014-03-17 │ 2014-03-17 09:41:09 │      415 │         9 │ │        0 │         7 │  35 │
│ 5747643029332244007 │ 2014-03-17 │ 2014-03-17 04:46:08 │       19 │         1 │ │        0 │         2 │ 238 │
│ 5868920473837897470 │ 2014-03-17 │ 2014-03-17 10:10:31 │       11 │         1 │ │        0 │         3 │  35 │
│ 6587050697748196290 │ 2014-03-17 │ 2014-03-17 09:06:47 │       18 │         2 │ │        0 │       120 │  35 │
│ 8872348705743297525 │ 2014-03-17 │ 2014-03-17 06:40:43 │      190 │         6 │ │        0 │         5 │ 238 │
│ 8890846394730359529 │ 2014-03-17 │ 2014-03-17 02:27:19 │        0 │         1 │ │        0 │        57 │  35 │
│ 7429587367586011403 │ 2014-03-17 │ 2014-03-17 01:13:14 │        0 │         1 │ │        1 │         1 │  12 │
│ 5195928066127503662 │ 2014-03-17 │ 2014-03-17 01:43:02 │     1926 │         3 │ │        0 │         2 │  35 │

You may notice I’ve specified the final modifier after the table name. I did this to ensure that data was fully merged and we got exactly one row for each session.

In ClickHouse CollapsingMergeTree engine is often used because it allows using inserts instead of updates (more details as usual in docs). With such an approach, you can have several rows per session in case of an update, and then the system merges it in the background. Using final modifier, we forced this process.

We can execute two simple queries to see the difference.

    uniqExact(VisitID) AS unique_sessions,
    sum(Sign) AS number_sessions, 
    -- number of sessions after collapsing
    count() AS rows
FROM datasets.visits_v1

│         1676685 │         1676581 │ 1680609 │

    uniqExact(VisitID) AS unique_sessions,
    sum(Sign) AS number_sessions,
    count() AS rows
FROM datasets.visits_v1

│         1676685 │         1676721 │ 1676721 │

Using final has its own drawbacks on performance.

How to ensure data quality

Verifying that there are no missing or duplicate events is pretty straightforward. You can find a lot of methods to detect anomalies in time series data starting from naive approaches (for example, the number of events is within +20% or -20% compared to the previous week) to ML with libraries like Prophet or PyCaret.

Data consistency is a bit trickier task. As I mentioned earlier, web analytics services track a lot of information about customers’ behavior on a site. They log hundreds of parameters, and we need to ensure that all these values look valid.

Parameters can be numeric (duration or the number of seen web pages) or categorical (browser or operating system). For numeric values, we can use statistical criteria to ensure that distribution stays the same — for example, the Kolmogorov-Smirnov test.

So after researching best practices, my only question was how to monitor consistency for categorical variables, and it’s time to talk about it.

Categorical variables

Let’s use browsers as an example. There are 62 unique values for browsers in our data.

SELECT uniqExact(UserAgent) AS unique_browsers
FROM datasets.visits_v1

│              62 │

    count() AS sessions,
    round((100. * sessions) / (
        SELECT count()
        FROM datasets.visits_v1
    ), 2) AS sessions_share
FROM datasets.visits_v1
HAVING sessions_share >= 1
ORDER BY sessions_share DESC

│         7 │   493225 │          29.42 │
│         2 │   236929 │          14.13 │
│         3 │   235439 │          14.04 │
│         4 │   196628 │          11.73 │
│       120 │   154012 │           9.19 │
│        50 │    86381 │           5.15 │
│        79 │    63082 │           3.76 │
│       121 │    50245 │              3 │
│         1 │    48688 │            2.9 │
│        42 │    21040 │           1.25 │
│         5 │    20399 │           1.22 │
│        71 │    19893 │           1.19 │

We can monitor shares for each browser separately as numeric variables, but in that case, we will monitor at least 12-time series for one field, UserAgent. Every person who has done at least one alert knows that the fewer variables we monitor - the better. When you are tracking many parameters, you have a lot of false positive notifications that you need to handle.

So I started to think about one metric that shows the difference between distributions. The idea is to compare browsers’ shares now (T2) and before (T1). We can choose the previous period depending on granularity:

  • for minute data — you can just look at the previous point,
  • for daily data — it’s worth looking at a day a week ago to take into account weekly seasonality,
  • for monthly data — you can look at a year ago data.

Let’s look at the following example.
My first idea was to look at a heuristic metric similar to the L1 norm used in machine learning.

For the example above, this formula will give us the following result — 10%. Actually, this metric has a meaning — it shows the minimal share of events in distribution for which the browser has changed.

Afterward, I discussed this topic with my boss, who has quite a lot of experience in data science. He advised me to look at Kullback-Leibler or Jensen-Shannon divergences since it’s a more valid approach to calculating distances between probability distributions.

If you don’t remember these metrics or have never heard of them before, don’t worry, I was in your shoes. So I googled the formulas (__this article __thoroughly explains the concepts) and calculated values for our example.

import numpy as np
prev = np.array([0.7, 0.2, 0.1])
curr = np.array([0.6, 0.27, 0.13])

def get_kl_divergence(prev, curr):
    kl = prev * np.log(prev / curr)
    return np.sum(kl)

def get_js_divergence(prev, curr): 
    mean = (prev + curr)/2
    return 0.5*(get_kl_divergence(prev, mean) + get_kl_divergence(curr, mean))

kl = get_kl_divergence(prev, curr)
js = get_js_divergence(prev, curr)
print('KL divergence = %.4f, JS divergence = %.4f' % (kl, js))

# KL divergence = 0.0216, JS divergence = 0.0055

As you can see, the distances we’ve calculated differ significantly. So now we have (at least) three ways to calculate the difference between browser shares before and now, and the next question is which way to choose for our monitoring task.

And the winner is…

The best way to estimate the performance of different methods is to see how they would perform in real-life situations. To do it we can simulate anomalies in data and compare effects.

There are two common cases of anomalies in data:

  • Data loss: we are starting to lose data from one of the browsers, and shares of all other browsers are increasing
  • Changes: it’s a case when traffic from one browser starts to be marked as another browser. For example, 10% of Safari events we now see as undefined.

We can take actual browser shares and simulate these anomalies. For simplicity, I will group all browsers with shares below 5% into the group browser = 0.

WITH browsers AS
            count() AS raw_sessions,
            (100. * count()) / (
                SELECT count()
                FROM datasets.visits_v1
            ) AS raw_sessions_share
        FROM datasets.visits_v1
        GROUP BY 1
    if(raw_sessions_share >= 5, UserAgent, 0) AS browser,
    sum(raw_sessions) AS sessions,
    round(sum(raw_sessions_share), 2) AS sessions_share
FROM browsers
GROUP BY browser
ORDER BY sessions DESC

│       7 │   493225 │          29.42 │
│       0 │   274107 │          16.35 │
│       2 │   236929 │          14.13 │
│       3 │   235439 │          14.04 │
│       4 │   196628 │          11.73 │
│     120 │   154012 │           9.19 │
│      50 │    86381 │           5.15 │

It’s time to simulate these two scenarios. You can find all code on GitHub. The most important parameter for us is the actual effect — the share of events lost or changed. Ideally, we would like our metric to equal this effect.

As a result of the simulation, we got two graphs that show correlations between factual effect and our distance metrics.

Each point in the graph shows the result of one simulation — actual effect and corresponding distance.

You can easily see that L1 norm is the best metric for our task because it’s the closest to the line distance = share of affected events. Kullback-Leibler and Jensen-Shannon divergences differ a lot and have different levels depending on use cases (what browser is losing traffic).

Such metrics don’t suit monitoring because you won’t be able to specify a threshold that will alert you in case more than 5% of traffic has been affected. Also, we can’t easily interpret these metrics, while L1 norm exactly shows the extent of the anomaly.

L1 norm calculation

Now we know what metric will show us the consistency of data, and the last task left is to implement L1 norm calculations in the database.

We can use widely-known window functions for it.

with browsers as (
        UserAgent as param,
            toStartOfHour(UTCStartTime) = '2014-03-18 12:00:00', 'previous',
            toStartOfHour(UTCStartTime) = '2014-03-18 13:00:00', 'current',
        ) as event_time,
        sum(Sign) as events
    from datasets.visits_v1
    where (StartDate = '2014-03-18')
    -- filter by partition key is a good practice
        and (event_time != 'other')
    group by param, event_time)
    sum(abs_diff)/2 as l1_norm
        sumIf(share, event_time = 'current') as curr_share,
        sumIf(share, event_time = 'previous') as prev_share,
        abs(curr_share - prev_share) as abs_diff
            sum(events) over (partition by event_time) as total_events,
            events/total_events as share
        from browsers)
    group by param)

│ 0.01515028932687386 │

ClickHouse has very powerful array functions, that I had been using for ages until window functions were supported. So I would like to show you the power of this tool.

with browsers as (
        UserAgent as param,
            toStartOfHour(UTCStartTime) = '2014-03-18 12:00:00', 'previous',
            toStartOfHour(UTCStartTime) = '2014-03-18 13:00:00', 'current',
        ) as event_time,
        sum(Sign) as events
    from datasets.visits_v1
    where StartDate = '2014-03-18' -- filter by partition key is a good practice
        and event_time != 'other'
    group by param, event_time
    order by event_time, param)
select l1_norm 
        -- aggregating all param values into arrays
        groupArrayIf(param, event_time = 'current') as curr_params,
        groupArrayIf(param, event_time = 'previous') as prev_params,
        -- calculating params that are present in both time periods or only in one of them
        arrayIntersect(curr_params, prev_params) as both_params,
        arrayFilter(x -> not has(prev_params, x), curr_params) as only_curr_params,
        arrayFilter(x -> not has(curr_params, x), prev_params) as only_prev_params,
        -- aggregating all events into arrays
        groupArrayIf(events, event_time = 'current') as curr_events,
        groupArrayIf(events, event_time = 'previous') as prev_events,
        -- calculating events shares
        arrayMap(x -> x / arraySum(curr_events), curr_events) as curr_events_shares,
        arrayMap(x -> x / arraySum(prev_events), prev_events) as prev_events_shares,
        -- filtering shares for browsers that are present in both periods
        arrayFilter(x, y -> has(both_params, y), curr_events_shares, curr_params) as both_curr_events_shares,
        arrayFilter(x, y -> has(both_params, y), prev_events_shares, prev_params) as both_prev_events_shares,
        -- filtering shares for browsers that are present only in one of periods
        arrayFilter(x, y -> has(only_curr_params, y), curr_events_shares, curr_params) as only_curr_events_shares,
        arrayFilter(x, y -> has(only_prev_params, y), prev_events_shares, prev_params) as only_prev_events_shares,
        -- calculating the abs differences and l1 norm
        arraySum(arrayMap(x, y -> abs(x - y), both_curr_events_shares, both_prev_events_shares)) as both_abs_diff,
        1/2*(both_abs_diff + arraySum(only_curr_events_shares) + arraySum(only_prev_events_shares)) as l1_norm
    from browsers)

│ 0.01515028932687386 │

This approach may be convenient for people with a Pythonic mindset. With persistence and creativity, one can write any logic using array functions.

Alerting & Monitoring

We have two queries that show us the fluctuations of browsers’ shares in our data. You can use this approach to monitor data that you are interested in.

The only bit left is to align with your team on alerting thresholds. I usually look at historical data and previous anomalies to pick up some initial levels and then constantly adjust them using new information: false positive alerts or missed anomalies.

Also, during the implementation of monitoring, I’ve faced a couple of nuances I would like to cover briefly:

  • There are parameters in data that there’s no sense in monitoring, for example, UserID or StartDate, so choose what parameters to include wisely.
  • You may have parameters with high cardinality. For example, StartURL in web analytics data has more than 600K unique values. Calculating metrics for it can be resource-consuming. So I would suggest either bucketing these values (for example, take domain or TLD) or monitoring only top values and grouping others into the separate group “other”.
  • You can use the same framework for numeric values using bucketing.
  • There are cases when significant changes in data are expected. For example, if you are monitoring the app version field, you will get alerts after each release. And such events can be helpful to ensure your monitoring is still working :)

I hope this article was helpful to you, and now you have an idea of how to monitor data quality or at least how to write complex ClickHouse queries using array functions. Stay tuned!

Also published here.

The lead image for this article was generated by HackerNoon's AI Image Generator via the prompt "visualize data”.