paint-brush
Efficient Data Storage for Rapid Analysis and Visualizationby@alexpkzdl
2,651 reads
2,651 reads

Efficient Data Storage for Rapid Analysis and Visualization

by Aliaksandr PiatrouskiMarch 30th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Many applications receive a lot of shaped data from users or from some services. Usually, as a product, you want to benefit from this data and build some analysis system for it. In this article, I want to share one of the ways this data can be stored and used for analysis.
featured image - Efficient Data Storage for Rapid Analysis and Visualization
Aliaksandr Piatrouski HackerNoon profile picture



Many applications receive a lot of shaped data from users or from some services, just to name a few:


  • A trading application receives packages of data about instrument A every minute

  • World Support center application receives messages from users that include some constrained values (enums) like country, language, reason, etc.

  • Maps application receives a request to add a restaurant to a map, where the request can contain what type of food they make, their working hours, and whether or not they have job offers, etc.


There is one common thing in those examples - massive volumes of data. Usually, as a product, you want to benefit from this data and build some analysis system for it. And in this article, I want to share one of the ways this data can be stored and used for analysis.


The Trivial approach

Before going further let’s understand why we can’t always use a trivial algorithm for this. A trivial algorithm will allow you to store all this data. And when you want to build analytics, you can just export it as CSV and feed it to some nice tool.


There might be several problems with this (and many more depending on the tools you’re using):


  • Building CSV using 10 rows from your DB is fast, but let’s say it’s 100 000 or even more! Building a file with that many rows from the DB might take some time.


  • Let’s say we have an unlimited amount of time, but we will likely get OOM when working with that amount of data! Not good.


  • The data is not coming from one table! You have to use joins… Oh, my time… Oh, my memory!


In short, for big analytics, simple export tools won’t work. So let’s see how we can solve it in a different way. And it’s important to note, that it doesn’t matter what type of analysis we do, location-based, time-based, or something else - the issues are with the size of the data.

Granular Snapshot

Let’s recall that in the introductory examples, I mentioned that the data that we receive is coming in a shaped form, that is it contains some typed keys and values and we are interested in some fixed dimension.


Just to give an idea of the kind of analysis we can make with shaped data, here are a couple of examples:


  • Timeframe-based analysis - when we analyze the change of data within a given time frame.
  • Location-based analysis - when we analyze the change of data based on the location borders
  • Count-based analysis - when we analyzed how many records satisfy a given filter


I will use time-based analysis in this piece, but you should note that it scales easily to others.


So we start with some basic observations:


  1. The charts that people see don’t have many points, there is a limitation of the screen to say the very least.


  2. The timeframe can be built from smaller timeframes, for example, one hour can be represented as 30 minutes plus 30 minutes




Let’s understand how we can leverage this knowledge. For simplicity, we will work with support center user messages.


Let’s divide the whole timeline into buckets of granularity. We will get something similar to the picture below. Then, when a message comes we can easily tell in which granularity buckets it will appear.


In the example below, we got a message at 3:24. So, it will go to the 4th 1H-bucket, 2nd 3H-bucket, and 1st 6H-bucket.


Timeline with granularities and a message event



Let’s build a simplified code model for this in C#.


namespace SupportCenter
{
  public enum Country
  {
    USA,
    Canada,
    UK,
    // ...
  }
  
  public enum Platform
  {
    Web,
    iOS,
    Android,
    // ...
  }

  // There can be put other relevant enums

  public class UserMessage
  {
    public int Id { get; set; }
    public Country Country { get; set; }
    public Platform Platform { get; set; }
    public DateTime TimeOfSubmission { get; set; }
    public string UserData { get; set; }

        
    public DateTime ThirtyMinutesGranularity { get; set; }
    public DateTime HourGranularity { get; set; }
    // Other granularity fields can go here

    public UserMessage(
      Country country,
      Platform platform,
      DateTime timeOfSubmission,
      string userData)
    {
      Country = country;
      Platform = platform;
      TimeOfSubmission = timeOfSubmission;
      UserData = userData;

      ThirtyMinutesGranularity = GetGranularityStartTime(timeOfSubmission, TimeSpan.FromMinutes(30));
      HourGranularity = GetGranularityStartTime(timeOfSubmission, TimeSpan.FromHours(1));
      // ...
    }

    private DateTime GetGranularityStartTime(DateTime timeOfSubmission, TimeSpan granularity)
    {
        long granularityTicks = granularity.Ticks;
        long submissionTimeTicks = timeOfSubmission.Ticks;
        return new DateTime(submissionTimeTicks - submissionTimeTicks % granularityTicks, DateTimeKind.Utc);
    }
}


Amazing, now we know to which granularity windows each message belongs (for the sake of simplicity of use we identify each granularity bucket by its start, not number)


But how do we understand which granularity to use?


Let’s use the first observation, that we don’t need many data points on the chart. So if we’re given some timeframe that we want to display for the visual analysis we don’t need to have more than 500 points, but no less than 100 points. Also, we need to align on the minimal granularity. In this article, we will use 1H.


In the table below, I identify what granularity to use if a user is looking at the chart in a given timeframe.


The rule is simple. Starting from 1H find the min/max timeframes it can cover as

1H * 100 < 1H-min-timeframe < 1H * 500 < 1H-max-timeframe, then you can find another granularity that will suit this formula.

Displayed granularity

Min timeframe

Max timeframe

0.5H

2D

10D

2H

10D

1M

1W

4M

1Y


Let’s write a function that gets us a granularity by the timeframe.


public static TimeSpan GetGranularityForTimeFrame(int from, int to)
{
  var duration = to - from;
  if (duration < 10 * Time.SEC_IN_DAY)
  {
    return TimeSpan.FromMinutes(30);
  }
  if (duration < Time.SEC_IN_MONTH) 
  {
    return TimeSpan.FromHours(2);
  }
  // And so on
}


We also need to find the first and last blocks of granularity that will cover from and to. That is easy to do with GetGranularityStartTime that we’ve defined earlier.


So what do we do with all this?


Knowing all this, we can now create a new table in the database (HitsTable) for storing the number of hits of a tuple (granularity_block, , granularity). The hit occurs when a message is created.


public static int[] GetDataPoints(int from, int to)
{
  var granularity = GetGranularityForTimeFrame(from, to);
  var query_first_block = GetGranularityStartTime(from, granularity);
  var query_last_block = GetGranularityStartTime(to, granularity);
 
  // Some pseudocode here to show that you can use some ORM to query hits table
  return HitsTable
    .queryWhere(granularity_block >= query_first_block)
    .queryWhere(granularity_block <= query_last_block)
    .queryWhere(granularity = granularity)
    .groupBy(granularity_block)
    .value();
}


Not fixed dimensions

As a follow-up, it’s pretty simple to extend this to work with more than one dimension of filters. For this, the hits table should also contain fields that you want to allow filtering on. It will be turned into (granularity_block, granularity, country, platform, …). What’s important to know here is that this will require you to store a lot more data.

Summary

In this article, we studied one of the approaches to storing data so it doesn’t have problems with RAM when loading data from DB (but we sacrificed some hard drive space for this). However, it works fast to fetch data points to visualize on given charts.



Lead image generated with stable diffusion.

Prompt: Illustrate an arbitrary database.