paint-brush
What is the Significance of Time-Weighted Averages in Data Analysisby@davidkohn
243 reads

What is the Significance of Time-Weighted Averages in Data Analysis

by David Kohn18mAugust 14th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data. Time-series data comes at you fast, sometimes generating millions of data points per second. TimescaleDB is a petabyte-scale, completely free relational database for time-series. Using time_weight and other hyperfunctions, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use TimewiseDB.

Company Mentioned

Mention Thumbnail
featured image - What is the Significance of Time-Weighted Averages in Data Analysis
David Kohn HackerNoon profile picture

Learn how time-weighted averages are calculated, why they’re so powerful for data analysis, and how to use TimescaleDB hyperfunctions to calculate them faster – all using SQL.


Many people who work with time-series data have nice, regularly sampled datasets. Data could be sampled every few seconds, or milliseconds, or whatever they choose, but by regularly sampled, we mean the time between data points is basically constant.


Computing the average value of data points over a specified time period in a regular dataset is a relatively well-understood query to compose. But for those who don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write.


Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.


Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a multi-node, petabyte-scale, completely free relational database for time-series.


Irregularly sampled time-series data just adds another level of complexity – and is more common than you may think. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in:


  • Industrial IoT, where teams “compress” data by only sending points when the value changes;

  • Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations;

  • Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered)

  • ...and many, many more


If you prefer to manage your own database instances, you can download and install the timescaledb_toolkit extension on GitHub, after which you’ll be able to use time_weight and other hyperfunctions.



What are time-weighted averages?

I’ve been a developer at Timescale for over 3 years and worked in databases for about 5 years, but I was an electrochemist before that. As an electrochemist, I worked for a battery manufacturer and saw a lot of charts like these:

Example battery discharge curve, which describes how long a battery can power something. (Also a prime example of where time-weighted averages are 💯 necessary) Derived from https://www.nrel.gov/docs/fy17osti/67809.pdf

That’s a battery discharge curve, which describes how long a battery can power something. The x-axis shows capacity in Amp-hours, and since this is a constant current discharge, the x-axis is really just a proxy for time. The y-axis displays voltage, which determines the battery’s power output; as you continue to discharge the battery, the voltage drops until it gets to a point where it needs to be recharged.


When we’d do R&D for new battery formulations, we’d cycle many batteries many times to figure out which formulations make batteries last the longest.


If you look more closely at the discharge curve, you’ll notice that there are only two “interesting” sections:

Example battery discharge curve, calling out the “interesting bits” (the points in time where data changes rapidly)

These are the parts at the beginning and end of the discharge where the voltage changes rapidly. Between these two sections, there’s that long period in the middle, where the voltage hardly changes at all:

Example battery discharge curve, calling out the “boring bits” (the points in time where the data remains fairly constant)

Now, when I said before that I was an electrochemist, I will admit that I was exaggerating a little bit. I knew enough about electrochemistry to be dangerous, but I worked with folks with PhDs who knew a lot more than I did.


But, I was often better than them at working with data, so I’d do things like programming the potentiostat--the piece of equipment you hook the battery up to in order to perform these tests.


For the interesting parts of the discharge cycle (those parts at the start and end), we could have the potentiostat sample at its max rate, usually a point every 10 milliseconds or so. We didn’t want to sample as many data points during the long, boring parts where the voltage didn’t change because it would mean saving lots of data with unchanging values and wasting storage.


To reduce the boring data we’d have to deal with without losing the interesting bits, we’d set up the program to sample every 3 minutes, or when the voltage changed by a reasonable amount, say more than 5 mV.


In practice, what would happen is something like this:

Example battery discharge curve with data points superimposed to depict rapid sampling during the interesting bits and slower sampling during the boring bits.

By sampling the data in this way, we'd get more data during the interesting parts and less data during the boring middle section. That’s great!


It let us answer more interesting questions about the quickly changing parts of the curve and gave us all the information we needed about the slowly changing sections – without storing gobs of redundant data. But, here’s a question: given this dataset, how do we find the average voltage during the discharge?


That question is important because it was one of the things we could compare between this discharge curve and future ones, say 10 or 100 cycles later. As a battery ages, its average voltage drops, and how much it dropped over time could tell us how well the battery’s storage capacity held up during its lifecycle – and if it could turn into a useful product.


The problem is that the data in the interesting bits is sampled more frequently (i.e., there are more data points for the interesting bits), which would give it more weight when calculating the average, even though it shouldn't.

Example battery discharge curve, with illustrative data points to show that while we collect more data during the interesting bits, they shouldn’t count “extra.”

If we just took a naive average over the whole curve, adding the value at each point and dividing by the number of points, it would mean that a change to our sampling rate could change our calculated average...even though the underlying effect was really the same!


We could easily overlook any of the differences we were trying to identify – and any clues about how we could improve the batteries could just get lost in the variation of our sampling protocol.


Now, some people will say: well, why not just sample at max rate of the potentiostat, even during the boring parts? Well, these discharge tests ran really long.


They’d take 10 to 12 hours to complete, but the interesting bits could be pretty short, from seconds or minutes. If we sampled at the highest rate, one every 10ms or so, it would mean orders of magnitude more data to store even though we would hardly use any of it! And orders of magnitude more data would mean more cost, more time for analysis, all sorts of problems.


So the big question is: how do we get a representative average when we’re working with irregularly spaced data points?


Let’s get theoretical for a moment here:


(This next bit is a little equation-heavy, but I think they’re relatively simple equations, and they map very well onto their graphical representation. I always like it when folks give me the math and graphical intuition behind the calculations – but if you want, you can skip ahead to just see how time-weighted average is used.)


Mathy Bits: How to derive a time-weighted average

Let’s say we have some points like this:

A theoretical, irregularly sampled time-series dataset

Then, the normal average would be the sum of the values, divided by the total number of points:

Equation: average equals the sum of the values (v1, v2, v3, v4), divided by four

But, because they’re irregularly spaced, we need some way to account for that.


One way to think about it would be to get a value at every point in time, and then divide it by the total amount of time. This would be like getting the total area under the curve and dividing by the total amount of time ΔT.

The area under an irregularly sampled time-series dataset

Equation: better average equals the total area under the curve divided by the total amount of time delta T

(In this case, we’re doing a linear interpolation between the points). So, let’s focus on finding that area. The area between the first two points is a trapezoid:

A trapezoid representing the area under the first two points


Which is really a rectangle plus a triangle:

That same trapezoid broken down into a rectangle and a triangle. Okay, let’s calculate that area:

Equation: area equals delta t1 times v1 plus delta t1 times delta v1 divided by two

So just to be clear, that’s:

The delta t1 times v1 represents the area of the rectangle. And the delta t1 times delta v1 divided by two represents the area of triangle.

Okay. So now if we notice that:

Equation: deltav1 equals to v2 minus v1.

We can simplify this equation pretty nicely.

Start with:

Equation: delta t1 times v1 plus delta t1 times v2 minus v1 divided by two

Factor out:

delta t1 divided by two Equation: delta t1 divided by two times 2v1 plus the difference between v2 and v1

Simplify:

Equation: delta t1 divided by two times sum of v1 and v2

One cool thing to note is that this gives us a new way to think about this solution: it’s the average of each pair of adjacent values, weighted by the time between them:

Equation: area equals v1 plus v2 divided by two times delta t1

It’s also equal to the area of the rectangle drawn to the midpoint between v1 and v2:

The area of the trapezoid and of the rectangle, drawn to the midpoint between the two points, is the same.

Now that we’ve derived the formula for two adjacent points, we can repeat this for every pair of adjacent points in the dataset. Then all we need to do is sum that up, and that will be the time-weighted sum, which is equal to the area under the curve. (Folks who have studied calculus may actually remember some of this from when they were learning about integrals and integral approximations!)


With the total area under the curve calculated, all we have to do is divide the time-weighted sum by the overall ΔT and we have our time-weighted average. 💥


Now that we've worked through our time-weighted average in theory, let’s test it out in SQL.


How to compute time-weighted averages in SQL

Let’s consider the scenario of an ice cream manufacturer or shop owner who is monitoring their freezers. It turns out that ice cream needs to stay in a relatively narrow range of temperatures (~0-10℉)(Read the footnote 1) so that it doesn’t melt and re-freeze, causing those weird crystals that no one likes. Similarly, if ice cream gets too cold, it’s too hard to scoop.


The air temperature in the freezer will vary a bit more dramatically as folks open and close the door, but the ice cream temperature takes longer to change. Thus, problems (melting, pesky ice crystals) will only happen if it's exposed to extreme temperatures for a prolonged period. By measuring this data, the ice cream manufacturer can impose quality controls on each batch of product they’re storing in the freezer.


Taking this into account, the sensors in the freezer measure temperature in the following way: when the door is closed and we’re in the optimal range, the sensors take a measurement every 5 minutes; when the door is opened, the sensors take a measurement every 30 seconds until the door is closed, and the temperature has returned below 10℉.


To model that we might have a simple table like this:


CREATE TABLE freezer_temps (
	freezer_id int,
	ts timestamptz,
	temperature float);


And some data like this:


INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);


The period after the door opens, minutes 31-36, has a lot more data points. If we were to take the average of all the points, we would get a misleading value. The freezer was only above the threshold temperature for 5 out of 45 minutes (11% of the time period), but those minutes make up 10 out of 20 data points (50%!) because we sample freezer temperature more frequently after the door is opened.


To find the more accurate, time-weighted average temperature, let’s write the SQL for the formula above that handles that case. We’ll also get the normal average just for comparison’s sake. (Don’t worry if you have trouble reading it, we’ll write a much simpler version later).


WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id,
avg(temperature), -- the regular average
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average -- our derived average
FROM nextstep
GROUP BY freezer_id;


 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111


It does return what we want, and gives us a much better picture of what happened, but it’s not exactly fun to write, is it?


We’ve got a few window functions in there, some case statements to deal with nulls, and several CTEs to try to make it reasonably clear what’s going on. This is the kind of thing that can really lead to code maintenance issues when people try to figure out what’s going on and tweak it.


Code is all about managing complexity, lots of complex queries to accomplish a relatively simple task makes it much less likely that the developer who comes along next (or you in 3 months) will understand what’s going on, how to use it, or how to change it if they (or you!) need a different result. Or, worse, it means that the code will never get changed because people don’t quite understand what the query’s doing, and it just becomes a black box that no one wants to touch (including you).


TimescaleDB hyperfunctions to the rescue!

This is why we created hyperfunctions, to make complicated time-series data analysis less complex. Let’s look at what the time-weighted average freezer temperature query looks like if we use the hyperfunctions for computing time-weighted averages:


SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;


 freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111


Isn’t that so much more concise?! Calculate a time_weight with a ´Linear´ weighting method (that’s the kind of weighting derived above (Read the footnote 2)), then take the average of the weighted values, and we’re done. I like that API much better (and I’d better, because I designed it!).


What’s more, not only do we save ourselves from writing all that SQL, but it also becomes far, far easier to compose (build up more complex analyses over top of the time-weighted average). This is a huge part of the design philosophy behind hyperfunctions; we want to make fundamental things simple so that you can easily use them to build more complex, application-specific analyses.


Let’s imagine we’re not satisfied with the average over our entire dataset, and we want to get the time-weighted average for every 10-minute bucket:


SELECT time_bucket('10 mins'::interval, ts) as bucket, 
	freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY bucket, freezer_id;


We added a time_bucket, grouped by it, and done! Let’s look at some other kinds of sophisticated analysis that hyperfunctions enable.


Continuing with our ice cream example, let’s say that we’ve set our threshold because we know that if the ice cream spends more than 15 minutes above 15 ℉, it’ll develop those ice crystals that make it all sandy/grainy tasting. We can use the time-weighted average in a window function to see if that happened:


SELECT *, 
average(time_weight('Linear', ts, temperature) OVER fifteen_min) as rolling_twa
FROM freezer_temps
WINDOW fifteen_min AS 
(PARTITION BY freezer_id ORDER BY ts RANGE  '15 minutes'::interval PRECEDING)
ORDER BY freezer_id, ts;


 freezer_id |           ts           | temperature |    rolling_twa     
------------+------------------------+-------------+--------------------
          1 | 2020-01-01 00:00:00+00 |           4 |                   
          1 | 2020-01-01 00:05:00+00 |         5.5 |               4.75
          1 | 2020-01-01 00:10:00+00 |           3 |                4.5
          1 | 2020-01-01 00:15:00+00 |           4 |  4.166666666666667
          1 | 2020-01-01 00:20:00+00 |         3.5 | 3.8333333333333335
          1 | 2020-01-01 00:25:00+00 |           8 |  4.333333333333333
          1 | 2020-01-01 00:30:00+00 |           9 |                  6
          1 | 2020-01-01 00:31:00+00 |        10.5 |  7.363636363636363
          1 | 2020-01-01 00:31:30+00 |          11 |  7.510869565217392
          1 | 2020-01-01 00:32:00+00 |          15 |  7.739583333333333
          1 | 2020-01-01 00:32:30+00 |          20 |               8.13
          1 | 2020-01-01 00:33:00+00 |        18.5 |  8.557692307692308
          1 | 2020-01-01 00:33:30+00 |          17 |  8.898148148148149
          1 | 2020-01-01 00:34:00+00 |        15.5 |  9.160714285714286
          1 | 2020-01-01 00:34:30+00 |          14 |   9.35344827586207
          1 | 2020-01-01 00:35:00+00 |        12.5 |  9.483333333333333
          1 | 2020-01-01 00:35:30+00 |          11 | 11.369047619047619
          1 | 2020-01-01 00:36:00+00 |          10 | 11.329545454545455
          1 | 2020-01-01 00:40:00+00 |           7 |             10.575
          1 | 2020-01-01 00:45:00+00 |           5 |  9.741666666666667


The window here is over the previous 15 minutes, ordered by time. And it looks like we stayed below our ice-crystallization temperature!


We also provide a special rollup function so you can re-aggregate time-weighted values from subqueries. For instance:


SELECT average(rollup(time_weight)) as time_weighted_average 
FROM (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
		freezer_id, 
		time_weight('Linear', ts, temperature)
	FROM freezer_temps
	GROUP BY bucket, freezer_id) t;


time_weighted_average 
-----------------------
    6.636111111111111


This will give us the same output as a grand total of the first equation because we’re just re-aggregating the bucketed values.


But this is mainly there so that you can do more interesting analysis, like, say, normalizing each ten-minute time-weighted average by freezer to the overall time-weighted average.


WITH t as (SELECT time_bucket('10 mins'::interval, ts) as bucket, 
		freezer_id, 
		time_weight('Linear', ts, temperature)
	FROM freezer_temps
	GROUP BY bucket, freezer_id) 
SELECT bucket, 
	freezer_id, 
	average(time_weight) as bucketed_twa,  
	(SELECT average(rollup(time_weight)) FROM t) as overall_twa, 
	average(time_weight) / (SELECT average(rollup(time_weight)) FROM t) as normalized_twa
FROM t;


This kind of feature (storing the time-weight for analysis later) is most useful in a continuous aggregate, and it just so happens that we’ve designed our time-weighted average to be usable in that context!


We’ll be going into more detail on that in a future post, so be sure to subscribe to our newsletter so you can get notified when we publish new technical content.


We’d like to give a special thanks to @inselbuch, who submitted the GitHub issue that got us started on this project (as well as the other folks who 👍’d it and let us know they wanted to use it.)


We believe time-series data is everywhere, and making sense of it is crucial for all manner of technical problems. We built hyperfunctions to make it easier for developers to harness the power of time-series data. We’re always looking for feedback on what to build next and would love to know how you’re using hyperfunctions, problems you want to solve, or things you think should - or could - be simplified to make analyzing time-series data in SQL that much better. (To contribute feedback, comment on an open issue or in a discussion thread in GitHub.)



Footnotes

  1. I don’t know that these times or temperatures are accurate per se; however, the phenomenon of ice cream partially melting and refreezing causing larger ice crystals to form - and coarsening the ice cream as a result - is well documented. See, for instance, Harold McGee’s On Food And Cooking (p 44 in the 2004 revised edition). So, just in case you are looking for advice on storing your ice cream from a blog about time-series databases: for longer-term storage, you would likely want the ice cream to be stored below 0℉. Our example is more like a scenario you’d see in an ice cream display (e.g., in an ice cream parlor or factory line) since the ice cream is kept between 0-10℉ (ideal for scooping, because lower temperatures make ice cream too hard to scoop).


  2. We also offer ’LOCF’ or last observation carried forward weighting, which is best suited to cases where you record data points whenever the value changes (i.e., the old value is valid until you get a new one.) The derivation for that is similar, except the rectangles have the height of the first value, rather than the linear weighting we’ve discussed in this post (i.e., where we do linear interpolation between adjacent data points):


LOCF weighting is useful when you know the value is constant until the following point.

Rather than:

Linear weighting is useful when you are sampling a changing value at irregular intervals.

In general, linear weighting is appropriate for cases where the sampling rate is variable, but there are no guarantees provided by the system about only providing data when it changes. LOCF works best when there’s some guarantee that your system will provide data only when it changes, and you can accurately carry the old value until you receive a new one.


Previously published at https://blog.timescale.com/blog/what-time-weighted-averages-are-and-why-you-should-care/