Dealing with Missing Data in Financial Time Series - Recipes and Pitfalls

Written by vkirilin | Published 2024/04/03
Tech Story Tags: data-analysis | data-preprocessing | missingdata | financial-data | financial-markets | time-series-analysis | financial-modeling | hackernoon-top-story

TLDRI focus on methods to handle missing data in financial time series. Using some some example data I show that LOCF is usually a decent go-to method compared to dropping and imputation but has its faults - i.e. can create artificial undesirable jumps in data. However, alternatives like interpolation have their own problems especially in context of live prediction/forecasting.via the TL;DR App

If you’re anything like me - you’ve dealt with missing data in your datasets at least once. Or twice. Or one too many times…

Sometimes, all it takes to handle those pesky NAs is dropping them i.e. removing the rows containing missing data. However, this might not always be optimal, especially for time series data, and even more so for financial. Of course, this problem is well-studied, so many alternatives to dropping exist.

I will look into a few of those (listed below) and discuss the pros and cons:

  1. Dropping

  2. LOCF (last observation carried forward)

  3. Mean (or similar) imputation

  4. Interpolation

Spoiler alert: there is no one-size-fits-it-all approach! I will argue that LOCF is usually a good choice for finance but not without its drawbacks either. With that in mind, let me describe the methods and data that I’ll use to showcase them.

Note: if one wants to be pedantic, all of the methods 2-4 are examples of some imputation.

Model data

Let’s start with some examples of why one would care about dropping in the first place. To illustrate, I generated some over-simplified daily stock price data assuming it follows a random walk with no drift (i.e. average long-term price should stay constant) - not the most accurate but a benign assumption nevertheless.

np.random.seed(10)  # needed for reproducibility
price_moves = 3*pd.Series(np.random.randn(100))  # generating random "steps" with 0 mean
price_vec = 100 + price_moves.cumsum()  # generating brownian motion
price_vec.plot()

Well, the plot looks quite well-meaning.

Suppose we now want to find out the empirical mean of daily price differences -

price_vec.diff().mean() #sample mean
>0.20030544816842052

Obviously non-zero, unlike the generating series - but this is just sample noise. So far so good.

Dropping data

Now let’s deform this data a little bit by deleting a few data points:

price_vec_na_simple = price_vec.copy()
price_vec_na_simple.iloc[90:95] = np.array([np.NaN, np.NaN, np.NaN,
                                            np.NaN, np.NaN]) #
price_vec_na_simple.diff().mean()
>0.1433356258183252

We notice a couple of things right away -

  1. The mean is somehow non-NA even though the diff vector clearly will contain NAs

  2. The mean is different from the one we obtained before

Now, #1 is quite easy - pd.mean automatically remove NA by default.

But what about #2? Let’s rethink what we’re computing.

It’s easy to show that at least without NAs, the mean price difference should simply be (price_vec[99]-price_vec[0])/99 - indeed, when we sum price differences, all the “intermediate” pieces cancel out, like so (price_vec[1] - price_vec[0]) + (price_vec[2] - price_vec[1]) + ..!

Now, with the missing data inserted, if we first take differences and then drop NAs, this cancellation is broken - some easy math shows that you’re now computing (price_vec[99] - price_vec[0] - price_vec[95] + price_vec[89])/93.

To show this, notice that the following two terms are now omitted - price_vec[95] - price_vec[94] and price_vec[90] - price_vec[89], since (NA - any number) evaluates to NA and is then dropped.

Let’s verify this:

(price_vec[99] - price_vec[0])/99
>0.20030544816842052
(price_vec[99] - price_vec[0] - price_vec[95] + price_vec[89])/93
>0.1433356258183252

Now, it becomes clearer how we can fix things - we need to first drop NAs and then diff -

price_vec_na_simple.dropna().diff().mean()
>0.21095999328376203

The mean is almost back to where it should be - a small discrepancy happens because we now have fewer terms in the mean - 94 instead of 99.

Ok, looks like if we only care about the mean, we’re fine just using dropna(as long as we do it right)? After all, the difference between 0.2 and 0.21 is clearly within our noise tolerance in the first place. Well, not quite - let’s see why.

LOCF

What is LOCF?

LOCF stands for Last Observation Carried Forward. The idea behind it is super simple - if I record data at some time intervals, that may or may not be regular, if some particular interval’s observation is missing, we simply assume nothing has changed with our variable and replace it with the last non-missing value (for example - [3, 5, NA, 8] → [3, 5, 5, 8]). One may ask - why care about an interval with a missing observation in the first place, i.e. not just delete it as in the “dropping” method? Well, the answer lies in the inherent defect of the “dropping” that I didn’t mention above.

Suppose you record multiple quantities at once, especially those that don’t usually change too much too quickly - like hourly recordings of temperature and humidity. Suppose you have both values for 10:00, 11:00, and 12:00, but only humidity at 13:00. Do you just delete that “row” - i.e. pretend you don’t have a reading for 13:00? Well, that’s ok if you just have two variables - even though you just removed some potentially valuable info (the 13:00 humidity). But if you have many such occurrences or many variables at once, dropping might leave you with virtually no data at all!

A very attractive alternative is to just assume that nothing has changed for the temperature between 12:00 and 13:00. After all, if someone came to us at 12:30 and asked us - “what’s the current temperature”, we would have rightfully replied with the 12:00 reading (if we’re not able to get a new reading immediately, of course). Why not use the same logic for the 13:00 value?

Why use LOCF (in finance)?

First, let’s test out our new-found approach on the data from before:

price_vec_na_simple.ffill().diff().mean() # ffill performs LOCF by default
>0.20030544816842052

Looks like we recovered our old value precisely! In addition, if you want to do further research on the price difference data - it looks more “orderly” now as it has an entry for every day, even though five of those entries are now 0 (why? try to run price_vec_na_simple.ffill().diff().iloc[90:95] to see for yourself).

In addition to that, in finance, missing data and outlier data often come together. Let me illustrate that:

#inflate two observations, delete three next ones
price_moves_na[90] += 20
price_moves_na[91] += 30
price_moves_na[92] -= 50 # to "deflate" the price shock back

price_vec_na = (100 + price_moves_na.cumsum())
price_vec_na[92:95] = [np.NaN, np.NaN, np.NaN]

price_vec_na.tail(20).plot()
price_vec_na.diff().dropna().mean()
>0.7093365245831178

We can see that after a sharp price increase, the data is simply not available for 3 days straight. This is not such an "artificial” example as it might sound! Imagine the trading halted after the spike, at least on this particular exchange. Then things settled down a bit, so the price went back to the normal regime. Maybe something gradual was going on behind the scenes that actually “connected” the dots between the spike and the post-spike calm down. But you don’t know that and don’t have any data for it!

What is the most natural assumption if we don’t have any new data? Well, recall that our data-generating model was fundamentally based on price changes. So if there’s no new data, perhaps the price is not changing at all? This is exactly what LOCF (Last Observation Carried Forward) assumes.

Some math for context

A side note for an inquisitive reader - perhaps a more fundamental view on why LOCF is particularly suitable for stock price data is that it’s usually modeled as a martingale. Roughly speaking, a martingale is something where our best guess for tomorrow is what we see today, or E[x_{t+1} | x_t] = x_t

Ok, back to the actual data! Let’s examine the consequences of LOCF both visually and numerically:

price_vec_na.ffill().tail(20).plot()
price_vec_na.ffill().diff().mean()
>0.20030544816842052

Immediately, we see the pros and cons of LOCF (quite literally)! For one, the mean is back to where we “expect” it to be - i.e. the unaltered empirical value. However, we introduce a rather ugly period where the price is out of line with the “typical” and an artificial drop in price between days 94 and 95.

What about imputation?

Let’s contrast the results we get from LOCF with (mean) imputation. It’s a very common choice for NA handling especially for non-time series data. However, if done naively, it has many drawbacks when used for financial data.

  • If you just use all-sample mean, you introduce an obvious look-ahead bias - i.e. you use future data to impute past values.

  • Using some sort of look-back or rolling mean is certainly better - however, it might sometimes come at a tension with the martingale “baseline” view that we described before.

Let’s look into this in a bit more detail. I will use the look-back imputation on our old data -

price_vec_na_impute = price_vec_na.copy()
price_vec_na_impute[price_vec_na_impute.isna()] = price_vec_na.iloc[:90].mean()
price_vec_na_impute.diff().mean()
>0.20030544816842052

We recover the “correct” price change mean, same as LOCF. BUT we introduce an artificial price drop between days 91 and 92 which in some ways is even worse than the one we had before. After all, that one happened when or after things likely calmed down, whereas this one just assumes everything goes back to normal right away. Aside from that, in practice it can be somewhat challenging to balance the look-back window such that we a) capture recent trends but also b) capture long-term tendencies (the usual bias-variance tradeoff).

Adding a second variable

Suppose we now want to perform a more complex task - extract the correlation between the price moves of two assets from the empirical data, when one or both of the price series have missing data. Sure, we can still use dropping, but:

  • even if we can use it, is it optimal?

  • what if we have many variables - then dropping all rows with at least one NA could leave us with no data at all!

There are many reasons one may want to compute correlation - it’s the first step of EDA in nearly every multi-variable model, it’s used pretty widely in any sort of portfolio construction, and so on. So measuring this number as accurately as we can is quite necessary!

To illustrate, let’s generate a second variable with a “built-in” correlation of 0.4 to the first one. To do it, we’ll use a sort of Gaussian Mixture Model. The picture one can have in mind is of two correlated stocks that share an important risk factor, but the second stock also has exposure to a major risk factor that the first one does not. Think Google and Facebook for example - the first factor could be generic sentiment about the tech sector and the second could be competition with rival social networks.

np.random.seed(2) # needed to ensure a fixed second series
price_moves_2 = pd.Series(np.random.randn(100))
price_vec_2 = 50+(0.4*price_moves/3 + np.sqrt(1-0.4**2)*price_moves_2).cumsum() # all this math to ensure we get a 0.4 "theoretical" correlation with the first one

Let’s check the “baseline” empirical correlation - that is, without the NAs and jumps.

pd.concat([price_vec, price_vec_2], axis = 1).diff().corr().iloc[0,1]
>0.4866403018044526

Now this is reasonably close to the “theoretical” correlation - it’s well known that empirical measurement of correlation is prone to rather large noise.

NAs without outliers

As a next step, we’ll examine the case with NAs, but no outliers. We’ll also compare what happens if we dropna before and after diff

pd.concat([price_vec_na_simple, price_vec_2], axis = 1).diff().corr().iloc[0,1] # implicit dropna after diff by default in corr 
>0.5022675176281746
pd.concat([price_vec_na_simple, price_vec_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.5287405341268966

Both results are quite close and not too far from the “empirical” value we got before. Let’s verify LOCF and imputation also perform OK:

pd.concat([price_vec_na_simple, price_vec_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.5049380499525835
price_vec_na_simple_impute = price_vec_na_simple.copy()
price_vec_na_simple_impute[price_vec_na_simple_impute.isna()] = price_vec_na_simple_impute.iloc[:90].mean()
pd.concat([price_vec_na_simple_impute, price_vec_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.4866728183859715

From comparing the above 4 results, we see that all methods perform quite well. Perhaps we should expect the same for the outlier case then?

NAs with outliers

Remember, to stay consistent we need to expose the second price series to the same price shocks the first one experienced, but without the following NAs. Going back to the example above - imagine some major event causing a spike in the first risk factor that eventually halts trading in the first asset. The second asset will experience those too, sure, but perhaps to a lesser extent, and so no halting would take place and thus no NAs.

price_vec_na_2 = 50+(0.4*price_moves_na/3 + np.sqrt(1-0.4**2)*price_moves_2).cumsum()

Let’s again compare the performance of all of our methods -

pd.concat([price_vec_na, price_vec_na_2], axis = 1).diff().corr().iloc[0,1]
>0.6527112906179914
pd.concat([price_vec_na, price_vec_na_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.7122391279139506

That’s quite a difference with both the theoretical and empirical value! How about LOCF and impute?

pd.concat([price_vec_na, price_vec_na_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.33178239830519984
pd.concat([price_vec_na_impute, price_vec_na_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.7280990594963112

Now we finally see what’s LOCF worth! It clearly outperforms all the other methods!

EDIT: NOT REALLY! To see why, let’s compute a more robust and perhaps understandable quantity - the regression beta between price moves.

The Catch

We will regress the price moves of second price vector vs the first. We’ll start by simpler “no outliers” cases as a sanity check.

import statsmodels.formula.api as smf

smf.ols('y ~ x',  # this includes intercept by default
        pd.concat([price_vec, price_vec_2], axis = 1, keys = ['x', 'y']).diff()).fit().params[1]
>0.18197413617036273 # empirical beta
0.4/3
>0.13333333333333333 # "theoretical" regression beta

The beta is biased, but if one observes the confidence interval for it using the summary() function, the two values are still in agreement. How about adding the NAs?

smf.ols('y ~ x', 
        pd.concat([price_vec_na_simple, price_vec_2], axis = 1, keys = ['x', 'y']).dropna().diff()).fit().params[1]
>0.18598423322170504
smf.ols('y ~ x', 
        pd.concat([price_vec_na_simple, price_vec_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.1873092624645616

Pretty smooth so far. But what changes if we add outliers?

smf.ols('y ~ x', 
        pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).dropna().diff().dropna()).fit().params[1]
>0.13921849344912401
smf.ols('y ~ x', 
        pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.06785360083909411

Well, that’s surprising! Why is LOCF underperforming so much here? Well, to understand, it’s helpful to examine some values:

pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff().loc[95]
> x   -47.556730
 y     1.321239

pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).dropna().diff().loc[95]
> x   -47.556730
y    -5.503415

What's going on here? By doing LOCF, we’re effectively comparing price_1[95] - price_1[91] vs price_2[95] - price_2[94]. Hence, one of the price moves does contain a jump in it whereas the other does not. When you do dropping instead, this does not happen - you compare price_1[95] - price_1[91] vs price_2[95] - price_2[91] - now the second price move also contains the jump! The mean imputation works just as well here:

smf.ols('y ~ x', 
        pd.concat([price_vec_na_impute, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.14041413077955972

So how come things seemed to work with correlation? Well, secretly, it didn’t! The point is, the original claim of 0.4 “theoretical” correlation also relied on the standard deviation of price_moves/3 being 1 - this was the reason for the 1/3 factor in the definition. The vector price_moves had st.dev = 1, but it was broken by introducing the price jumps! This messes up the “theoretical” correlation computation - recall that corr(x,y) = cov(x,y)/(std(x) * std(y)). So changing the standard deviation changes all our conclusions in the previous chapter. A curious reader can go back to that computation to convince themselves that the correlation results for dropping was actually “correct” in that sense - as also evidenced by beta!

Potential Pitfalls + Interpolation

Of course, this shows that LOCF is not 100% robust. For one, by doing LOCF we introduce a large price drop when the missing data ends. If it coincides with some outliers on the second price vector, the results could change by quite a bit. (*An exercise for the reader - flip the sign on the price move of price_vec_na_2[95] and check how it affects the results). It’s not quite clear whether it’s “clean” to just introduce this price drop as opposed to e.g. interpolating between the price peak price_vec_na[91] and the “normal” value afterward price_vec_na[95]. However, especially for a “live” usage, interpolation is not really possible! After all, if today is day #93, how can we interpolate using a future value recorded at the end of day #95? For a historical study - sure, that remains an option, but then it remains unclear how to interpret and use it for actual forecasting! In conclusion, interpolation across the time dimension is possible, but somewhat more questionable.

Conclusions

I tried to give a small case study to introduce and advocate why LOCF is often the most attractive and simple option to use to handle missing data in financial time series, but has to be used in a mindful way, especially in the presence of outliers!

To recap, the pros are:

  • Attractive from a martingale/”information flow” perspective
  • Super easy to implement
  • No need to optimize (as opposed to imputation, e.g.)
  • Handles outliers at least decently - EDIT: see the chapter added for a counterexample!

Some cons:

  • Can potentially cause large jumps at the end of the missing period
  • Might miss some nuanced joint dynamics when used for several variables

As a quant in a prop trading shop, I use it for almost all of my studies as an effective baseline. Some situations call for more nuanced measures of course, but those are few and far between and usually are not really 100% “solved” by any of the 3 other methods mentioned either.


Written by vkirilin | Quant @ Five Rings Capital
Published by HackerNoon on 2024/04/03