If you’re anything like me  you’ve dealt with missing data in your datasets at least once. Or twice. Or one too many times…
Sometimes, all it takes to handle those pesky NAs is dropping them i.e. removing the rows containing missing data. However, this might not always be optimal, especially for time series data, and even more so for financial. Of course, this problem is wellstudied, so many alternatives to dropping exist.
I will look into a few of those (listed below) and discuss the pros and cons:
Dropping
LOCF (last observation carried forward)
Mean (or similar) imputation
Interpolation
Spoiler alert: there is no onesizefitsitall approach! I will argue that LOCF is usually a good choice for finance but not without its drawbacks either. With that in mind, let me describe the methods and data that I’ll use to showcase them.
Note: if one wants to be pedantic, all of the methods 24 are examples of some imputation.
Let’s start with some examples of why one would care about dropping in the first place. To illustrate, I generated some oversimplified daily stock price data assuming it follows a random walk with no drift (i.e. average longterm price should stay constant)  not the most accurate but a benign assumption nevertheless.
np.random.seed(10) # needed for reproducibility
price_moves = 3*pd.Series(np.random.randn(100)) # generating random "steps" with 0 mean
price_vec = 100 + price_moves.cumsum() # generating brownian motion
price_vec.plot()
Well, the plot looks quite wellmeaning.
Suppose we now want to find out the empirical mean of daily price differences 
price_vec.diff().mean() #sample mean
>0.20030544816842052
Obviously nonzero, unlike the generating series  but this is just sample noise. So far so good.
Now let’s deform this data a little bit by deleting a few data points:
price_vec_na_simple = price_vec.copy()
price_vec_na_simple.iloc[90:95] = np.array([np.NaN, np.NaN, np.NaN,
np.NaN, np.NaN]) #
price_vec_na_simple.diff().mean()
>0.1433356258183252
We notice a couple of things right away 
The mean is somehow nonNA even though the diff
vector clearly will contain NAs
The mean is different from the one we obtained before
Now, #1 is quite easy  pd.mean
automatically remove NA by default.
But what about #2? Let’s rethink what we’re computing.
It’s easy to show that at least without NAs, the mean price difference should simply be (price_vec[99]price_vec[0])/99
 indeed, when we sum price differences, all the “intermediate” pieces cancel out, like so (price_vec[1]  price_vec[0]) + (price_vec[2]  price_vec[1]) + ..
!
Now, with the missing data inserted, if we first take differences and then drop NA
s, this cancellation is broken  some easy math shows that you’re now computing (price_vec[99]  price_vec[0]  price_vec[95] + price_vec[89])/93
.
To show this, notice that the following two terms are now omitted  price_vec[95]  price_vec[94]
and price_vec[90]  price_vec[89]
, since (NA  any number)
evaluates to NA and is then dropped.
Let’s verify this:
(price_vec[99]  price_vec[0])/99
>0.20030544816842052
(price_vec[99]  price_vec[0]  price_vec[95] + price_vec[89])/93
>0.1433356258183252
Now, it becomes clearer how we can fix things  we need to first drop NAs and then diff

price_vec_na_simple.dropna().diff().mean()
>0.21095999328376203
The mean is almost back to where it should be  a small discrepancy happens because we now have fewer terms in the mean  94 instead of 99.
Ok, looks like if we only care about the mean, we’re fine just using dropna
(as long as we do it right)? After all, the difference between 0.2
and 0.21
is clearly within our noise tolerance in the first place. Well, not quite  let’s see why.
LOCF stands for Last Observation Carried Forward. The idea behind it is super simple  if I record data at some time intervals, that may or may not be regular, if some particular interval’s observation is missing, we simply assume nothing has changed with our variable and replace it with the last nonmissing value (for example  [3, 5, NA, 8] → [3, 5, 5, 8]). One may ask  why care about an interval with a missing observation in the first place, i.e. not just delete it as in the “dropping” method? Well, the answer lies in the inherent defect of the “dropping” that I didn’t mention above.
Suppose you record multiple quantities at once, especially those that don’t usually change too much too quickly  like hourly recordings of temperature and humidity. Suppose you have both values for 10:00, 11:00, and 12:00, but only humidity at 13:00. Do you just delete that “row”  i.e. pretend you don’t have a reading for 13:00? Well, that’s ok if you just have two variables  even though you just removed some potentially valuable info (the 13:00 humidity). But if you have many such occurrences or many variables at once, dropping might leave you with virtually no data at all!
A very attractive alternative is to just assume that nothing has changed for the temperature between 12:00 and 13:00. After all, if someone came to us at 12:30 and asked us  “what’s the current temperature”, we would have rightfully replied with the 12:00 reading (if we’re not able to get a new reading immediately, of course). Why not use the same logic for the 13:00 value?
First, let’s test out our newfound approach on the data from before:
price_vec_na_simple.ffill().diff().mean() # ffill performs LOCF by default
>0.20030544816842052
Looks like we recovered our old value precisely! In addition, if you want to do further research on the price difference data  it looks more “orderly” now as it has an entry for every day, even though five of those entries are now 0 (why? try to run price_vec_na_simple.ffill().diff().iloc[90:95]
to see for yourself).
In addition to that, in finance, missing data and outlier data often come together. Let me illustrate that:
#inflate two observations, delete three next ones
price_moves_na[90] += 20
price_moves_na[91] += 30
price_moves_na[92] = 50 # to "deflate" the price shock back
price_vec_na = (100 + price_moves_na.cumsum())
price_vec_na[92:95] = [np.NaN, np.NaN, np.NaN]
price_vec_na.tail(20).plot()
price_vec_na.diff().dropna().mean()
>0.7093365245831178
We can see that after a sharp price increase, the data is simply not available for 3 days straight. This is not such an "artificial” example as it might sound! Imagine the trading halted after the spike, at least on this particular exchange. Then things settled down a bit, so the price went back to the normal regime. Maybe something gradual was going on behind the scenes that actually “connected” the dots between the spike and the postspike calm down. But you don’t know that and don’t have any data for it!
What is the most natural assumption if we don’t have any new data? Well, recall that our datagenerating model was fundamentally based on price changes. So if there’s no new data, perhaps the price is not changing at all? This is exactly what LOCF (Last Observation Carried Forward) assumes.
A side note for an inquisitive reader  perhaps a more fundamental view on why LOCF is particularly suitable for stock price data is that it’s usually modeled as a martingale. Roughly speaking, a martingale is something where our best guess for tomorrow is what we see today, or E[x_{t+1}  x_t] = x_t
Ok, back to the actual data! Let’s examine the consequences of LOCF both visually and numerically:
price_vec_na.ffill().tail(20).plot()
price_vec_na.ffill().diff().mean()
>0.20030544816842052
Immediately, we see the pros and cons of LOCF (quite literally)! For one, the mean is back to where we “expect” it to be  i.e. the unaltered empirical value. However, we introduce a rather ugly period where the price is out of line with the “typical” and an artificial drop in price between days 94 and 95.
Let’s contrast the results we get from LOCF with (mean) imputation. It’s a very common choice for NA handling especially for nontime series data. However, if done naively, it has many drawbacks when used for financial data.
If you just use allsample mean, you introduce an obvious lookahead bias  i.e. you use future data to impute past values.
Using some sort of lookback or rolling mean is certainly better  however, it might sometimes come at a tension with the martingale “baseline” view that we described before.
Let’s look into this in a bit more detail. I will use the lookback imputation on our old data 
price_vec_na_impute = price_vec_na.copy()
price_vec_na_impute[price_vec_na_impute.isna()] = price_vec_na.iloc[:90].mean()
price_vec_na_impute.diff().mean()
>0.20030544816842052
We recover the “correct” price change mean, same as LOCF. BUT we introduce an artificial price drop between days 91 and 92 which in some ways is even worse than the one we had before. After all, that one happened when or after things likely calmed down, whereas this one just assumes everything goes back to normal right away. Aside from that, in practice it can be somewhat challenging to balance the lookback window such that we a) capture recent trends but also b) capture longterm tendencies (the usual biasvariance tradeoff).
Suppose we now want to perform a more complex task  extract the correlation between the price moves of two assets from the empirical data, when one or both of the price series have missing data. Sure, we can still use dropping, but:
even if we can use it, is it optimal?
what if we have many variables  then dropping all rows with at least one NA could leave us with no data at all!
There are many reasons one may want to compute correlation  it’s the first step of EDA in nearly every multivariable model, it’s used pretty widely in any sort of portfolio construction, and so on. So measuring this number as accurately as we can is quite necessary!
To illustrate, let’s generate a second variable with a “builtin” correlation of 0.4 to the first one. To do it, we’ll use a sort of Gaussian Mixture Model. The picture one can have in mind is of two correlated stocks that share an important risk factor, but the second stock also has exposure to a major risk factor that the first one does not. Think Google and Facebook for example  the first factor could be generic sentiment about the tech sector and the second could be competition with rival social networks.
np.random.seed(2) # needed to ensure a fixed second series
price_moves_2 = pd.Series(np.random.randn(100))
price_vec_2 = 50+(0.4*price_moves/3 + np.sqrt(10.4**2)*price_moves_2).cumsum() # all this math to ensure we get a 0.4 "theoretical" correlation with the first one
Let’s check the “baseline” empirical correlation  that is, without the NAs and jumps.
pd.concat([price_vec, price_vec_2], axis = 1).diff().corr().iloc[0,1]
>0.4866403018044526
Now this is reasonably close to the “theoretical” correlation  it’s well known that empirical measurement of correlation is prone to rather large noise.
As a next step, we’ll examine the case with NAs, but no outliers. We’ll also compare what happens if we dropna
before and after diff
pd.concat([price_vec_na_simple, price_vec_2], axis = 1).diff().corr().iloc[0,1] # implicit dropna after diff by default in corr
>0.5022675176281746
pd.concat([price_vec_na_simple, price_vec_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.5287405341268966
Both results are quite close and not too far from the “empirical” value we got before. Let’s verify LOCF and imputation also perform OK:
pd.concat([price_vec_na_simple, price_vec_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.5049380499525835
price_vec_na_simple_impute = price_vec_na_simple.copy()
price_vec_na_simple_impute[price_vec_na_simple_impute.isna()] = price_vec_na_simple_impute.iloc[:90].mean()
pd.concat([price_vec_na_simple_impute, price_vec_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.4866728183859715
From comparing the above 4 results, we see that all methods perform quite well. Perhaps we should expect the same for the outlier case then?
Remember, to stay consistent we need to expose the second price series to the same price shocks the first one experienced, but without the following NAs. Going back to the example above  imagine some major event causing a spike in the first risk factor that eventually halts trading in the first asset. The second asset will experience those too, sure, but perhaps to a lesser extent, and so no halting would take place and thus no NAs.
price_vec_na_2 = 50+(0.4*price_moves_na/3 + np.sqrt(10.4**2)*price_moves_2).cumsum()
Let’s again compare the performance of all of our methods 
pd.concat([price_vec_na, price_vec_na_2], axis = 1).diff().corr().iloc[0,1]
>0.6527112906179914
pd.concat([price_vec_na, price_vec_na_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.7122391279139506
That’s quite a difference with both the theoretical and empirical value! How about LOCF and impute?
pd.concat([price_vec_na, price_vec_na_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.33178239830519984
pd.concat([price_vec_na_impute, price_vec_na_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.7280990594963112
Now we finally see what’s LOCF worth! It clearly outperforms all the other methods!
EDIT: NOT REALLY! To see why, let’s compute a more robust and perhaps understandable quantity  the regression beta between price moves.
We will regress the price moves of second price vector vs the first. We’ll start by simpler “no outliers” cases as a sanity check.
import statsmodels.formula.api as smf
smf.ols('y ~ x', # this includes intercept by default
pd.concat([price_vec, price_vec_2], axis = 1, keys = ['x', 'y']).diff()).fit().params[1]
>0.18197413617036273 # empirical beta
0.4/3
>0.13333333333333333 # "theoretical" regression beta
The beta is biased, but if one observes the confidence interval for it using the summary()
function, the two values are still in agreement. How about adding the NAs?
smf.ols('y ~ x',
pd.concat([price_vec_na_simple, price_vec_2], axis = 1, keys = ['x', 'y']).dropna().diff()).fit().params[1]
>0.18598423322170504
smf.ols('y ~ x',
pd.concat([price_vec_na_simple, price_vec_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.1873092624645616
Pretty smooth so far. But what changes if we add outliers?
smf.ols('y ~ x',
pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).dropna().diff().dropna()).fit().params[1]
>0.13921849344912401
smf.ols('y ~ x',
pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.06785360083909411
Well, that’s surprising! Why is LOCF underperforming so much here? Well, to understand, it’s helpful to examine some values:
pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff().loc[95]
> x 47.556730
y 1.321239
pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).dropna().diff().loc[95]
> x 47.556730
y 5.503415
What's going on here? By doing LOCF, we’re effectively comparing price_1[95]  price_1[91]
vs price_2[95]  price_2[94]
. Hence, one of the price moves does contain a jump in it whereas the other does not. When you do dropping instead, this does not happen  you compare price_1[95]  price_1[91]
vs price_2[95]  price_2[91]
 now the second price move also contains the jump! The mean imputation works just as well here:
smf.ols('y ~ x',
pd.concat([price_vec_na_impute, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.14041413077955972
So how come things seemed to work with correlation? Well, secretly, it didn’t! The point is, the original claim of 0.4 “theoretical” correlation also relied on the standard deviation of price_moves/3
being 1  this was the reason for the 1/3
factor in the definition. The vector price_moves
had st.dev = 1, but it was broken by introducing the price jumps! This messes up the “theoretical” correlation computation  recall that corr(x,y) = cov(x,y)/(std(x) * std(y))
. So changing the standard deviation changes all our conclusions in the previous chapter. A curious reader can go back to that computation to convince themselves that the correlation results for dropping was actually “correct” in that sense  as also evidenced by beta!
Of course, this shows that LOCF is not 100% robust. For one, by doing LOCF we introduce a large price drop when the missing data ends. If it coincides with some outliers on the second price vector, the results could change by quite a bit. (*An exercise for the reader  flip the sign on the price move of price_vec_na_2[95]
and check how it affects the results). It’s not quite clear whether it’s “clean” to just introduce this price drop as opposed to e.g. interpolating between the price peak price_vec_na[91]
and the “normal” value afterward price_vec_na[95]
. However, especially for a “live” usage, interpolation is not really possible! After all, if today is day #93, how can we interpolate using a future value recorded at the end of day #95? For a historical study  sure, that remains an option, but then it remains unclear how to interpret and use it for actual forecasting! In conclusion, interpolation across the time dimension is possible, but somewhat more questionable.
I tried to give a small case study to introduce and advocate why LOCF is often the most attractive and simple option to use to handle missing data in financial time series, but has to be used in a mindful way, especially in the presence of outliers!
To recap, the pros are:
Some cons:
As a quant in a prop trading shop, I use it for almost all of my studies as an effective baseline. Some situations call for more nuanced measures of course, but those are few and far between and usually are not really 100% “solved” by any of the 3 other methods mentioned either.