paint-brush
How to Use Approximate Leave-one-out Cross-validation to Build Better Modelsby@ryanburn
3,930 reads
3,930 reads

How to Use Approximate Leave-one-out Cross-validation to Build Better Models

by Ryan BurnJuly 20th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Leave-one-out Cross-validation (LOOCV) is one of the most accurate ways to estimate how well a model will perform on out-of-sample data. For the specialized cases of ridge regression, logistic regression, Poisson regression, and other generalized linear models, ALOOCV gives us a much more efficient estimate of out of-sample error that’s nearly as good as LOOCV. In this post, I’ll cover:What is ALOocV and how to use it to identify outliers in a training data set.

Company Mentioned

Mention Thumbnail
featured image - How to Use Approximate Leave-one-out Cross-validation to Build Better Models
Ryan Burn HackerNoon profile picture


Leave-one-out Cross-validation (LOOCV) is one of the most accurate ways to estimate how well a model will perform on out-of-sample data. Unfortunately, it can be expensive, requiring a separate model to be fit for each point in the training data set. For the specialized cases of ridge regression, logistic regression, Poisson regression, and other generalized linear models, though, Approximate Leave-one-out Cross-validation (ALOOCV) gives us a much more efficient estimate of out-of-sample error that’s nearly as good as LOOCV.


In this post, I’ll cover:


  • What is ALOOCV
  • How to compute ALOOCV using Python packages
  • How to use ALOOCV for hyperparameter optimization
  • How to identify outliers in a training data set with ALOOCV


First, a refresher on LOOCV.

What is LOOCV?

Suppose we’re fitting a model to a data set of n feature vectors:


and n associated target values


Let X denote the matrix of feature vectors and y denote the vector of target values. With leave-one-out cross-validation, we fit n different models.


For each data entry i we form a new data set:


consisting of the original data set with the ith entry removed. Then we fit our model and measure how well it predicts the ith target value. This gives us a leave-one-out error for the ith entry:


Averaging these errors across all data points then gives us an estimate of the out-of-sample error.


Research has repeatedly shown LOOCV to be more accurate than other forms of k-fold cross-validation for estimating out-of-sample error [1]. But LOOCV is really expensive. It requires us to fit many more models than a 3 or 10 fold cross-validation.

What is ALOOCV?

Enter Approximate Leave-one-out Cross-validation (ALOOCV).


I won’t give a detailed description of the math behind ALOOCV (check the references if you want that) [2], but here’s a brief description:


For the specialized case of generalized linear models, we can proceed to fit a model to the full training data set. When we do that, we find weights that optimize the model’s cost function. We can then use properties of the cost function optimum to accurately and efficiently estimate what target value the model would predict for a data entry if that data entry was removed from the training data set.


Let’s see how we might use ALOOCV.

Using ALOOCV for Hyperparameter Optimization

We’ll look at the classic Iris data set. If you’re not familiar with the data set, the task is to predict which of three species of iris a plant is based on the characteristics of the flower. In our example, we’ll use a multinomial logistic regression model to predict the iris species.


Whenever we fit a logistic regression model, we have a regularization parameter C that we need to tune.


C acts as a dial that controls the complexity of the model: If we set C too low, our model won’t take full advantage of the training data; but if we set C too high, it will overfit the training data and perform poorly on out-of-sample data.


Using ALOOCV, we can estimate how well logistic regression will perform for any given value of C.


Let’s plot out ALOOCV across a range of different C values. The Iris data set is small enough that it’s possible to compute LOOCV by brute force, so we’ll plot that out also so that we can see accurate ALOOCV is.


To compute ALOOCV, we use the Python package bbai, which can be installed using pip:


pip install bbai


The Iris data already set comes packaged with sklearn. We can load and normalize the data set with this snippet of code:


from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
X = np.hstack((X, np.ones((X.shape[0], 1)))) # add a bias column


Here’s how we can compute ALOOCV given a value of C


import bbai.glm

def compute_aloocv(C):
    model = bbai.glm.LogisticRegression(C=C, fit_intercept=False)
    model.fit(X, y)
    return model.aloocv_


Note that for logistic regression, cross-validation uses the negative log-likelihood as its error measurement.


This is how we can compute LOOCV


from sklearn.model_selection import LeaveOneOut

def compute_loocv(C):
    ll_sum = 0
    for train_indexes, test_indexes in LeaveOneOut().split(X):
        X_train = X[train_indexes]
        y_train = y[train_indexes]
        X_test = X[test_indexes]
        y_test = y[test_indexes]
        model = bbai.glm.LogisticRegression(C=C, fit_intercept=False)
        model.fit(X_train, y_train)
        pred = model.predict_log_proba(X_test)
        ll_sum += pred[0][y_test[0]]
    return -ll_sum / len(y)


And here’s how we can plot ALOOCV and LOOCV


import matplotlib.pyplot as plt

Cs = np.logspace(0.5, 2.5, num=100)
aloocvs = [compute_aloocv(C) for C in Cs]
loocvs = [compute_loocv(C) for C in Cs]
plt.plot(Cs, aloocvs, label='ALOOCV')
plt.plot(Cs, loocvs, label='LOOCV')
plt.xlabel('C')
plt.xscale('log')
plt.ylabel('Cross-Validation Error')
plt.legend()
plt.savefig('iris_cv.svg')


This results in

ALOOCV vs LOOCV for different regularization strengths

To select a value of C, we could quickly test different values of C and pick the one with the best ALOOCV value. But we can do much better.


ALOOCV isn’t just efficient to compute for a hyperparameter; it’s also possible to efficiently compute the first and second derivatives of ALOOCV with respect to hyperparameters [3]. Thus, we can apply a second-order optimizer to very quickly dial into the exact value of C that optimizes ALOOCV.


The bbai package can do all of this for us behind the scenes. Here’s how


model = bbai.glm.LogisticRegression(fit_intercept=False)
    # Note: when we don't provide a value for C, bbai.glm.LogisticRegression
    # will apply an optimizer to find the value of C with the best ALOOCV
model.fit(X, y)
print("C_opt = ", model.C_)


This prints out


C_opt =  67.38021801069182


In addition to hyperparameter optimization, ALOOCV can also tell us a lot about the training data set.

Identifying Outliers With ALOOCV

When we compute ALOOCV, as a byproduct we have approximate leave-one-out errors for each data point. If an approximate leave-one-out error is large, it indicates that the associated data point is an outlier. Outliers can be worth drilling into: We might want to use a different model for them; or they might indicate an error in data collection.


Going back to the iris data set, let’s order the data points by their approximate leave-one-out error and plot out the individual errors. We’ll also plot out the leave-one-out errors so we can see how close they are.


def compute_loocvs(C):
    cvs = []
    for train_indexes, test_indexes in LeaveOneOut().split(X):
        X_train = X[train_indexes]
        y_train = y[train_indexes]
        X_test = X[test_indexes]
        y_test = y[test_indexes]
        model = bbai.glm.LogisticRegression(C=C, fit_intercept=False)
        model.fit(X_train, y_train)
        pred = model.predict_log_proba(X_test)
        cvs.append(-pred[0][y_test[0]])
    return cvs

n = len(y)
aloocvs = model.aloocvs_
loocvs = compute_loocvs(model.C_)
indexes = list(range(n))
indexes = sorted(indexes, key=lambda i: -aloocvs[i])
aloocvs = [aloocvs[i] for i in indexes]
loocvs = [loocvs[i] for i in indexes]
ix = list(range(n))
plt.plot(ix, aloocvs, marker='x', label='ALOO', linestyle='None')
plt.plot(ix, loocvs, marker='+', label='LOO', linestyle='None')
plt.ylabel('Leave-one-out Error')
plt.xlabel('Data Point')
plt.legend()
plt.savefig('iris_loo.svg')


This displays:

Leave-one-out errors for C = 67.38

Looking at this graph, we could choose a cutoff point and select points to examine further.


Complete code from this blog can be found at https://github.com/rnburn/bbai/blob/master/example/02-iris.py.

References

[1]: For a comparison of LOOCV to other forms of k-fold cross-validation, see A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation [2]: To get more details on the math behind ALOOCV, see https://buildingblock.ai/logistic-regression-guide#approximate-leave-one-out-cross-validation [3]: For details on computing the derivatives of ALOOCV and optimizing it, see Optimizing Approximate-leave-one-out Cross-validation