How to Use Approximate Leave-one-out Cross-validation to Build Better Models

Leave-one-out Cross-validation (LOOCV) is one of the most accurate ways to estimate how well a model will perform on out-of-sample data. Unfortunately, it can be expensive, requiring a separate model to be fit for each point in the training data set. For the specialized cases of ridge regression, logistic regression, Poisson regression, and other generalized linear models, though, Approximate Leave-one-out Cross-validation (ALOOCV) gives us a much more efficient estimate of out-of-sample error that’s nearly as good as LOOCV.

In this post, I’ll cover:

What is ALOOCV
How to compute ALOOCV using Python packages
How to use ALOOCV for hyperparameter optimization
How to identify outliers in a training data set with ALOOCV

First, a refresher on LOOCV.

What is LOOCV?

Suppose we’re fitting a model to a data set of n feature vectors:

and n associated target values

Let X denote the matrix of feature vectors and y denote the vector of target values. With leave-one-out cross-validation, we fit n different models.

For each data entry i we form a new data set:

consisting of the original data set with the ith entry removed. Then we fit our model and measure how well it predicts the ith target value. This gives us a leave-one-out error for the ith entry:

Averaging these errors across all data points then gives us an estimate of the out-of-sample error.

Research has repeatedly shown LOOCV to be more accurate than other forms of k-fold cross-validation for estimating out-of-sample error [1]. But LOOCV is really expensive. It requires us to fit many more models than a 3 or 10 fold cross-validation.

What is ALOOCV?

Enter Approximate Leave-one-out Cross-validation (ALOOCV).

I won’t give a detailed description of the math behind ALOOCV (check the references if you want that) [2], but here’s a brief description:

For the specialized case of generalized linear models, we can proceed to fit a model to the full training data set. When we do that, we find weights that optimize the model’s cost function. We can then use properties of the cost function optimum to accurately and efficiently estimate what target value the model would predict for a data entry if that data entry was removed from the training data set.

Let’s see how we might use ALOOCV.

Using ALOOCV for Hyperparameter Optimization

We’ll look at the classic Iris data set. If you’re not familiar with the data set, the task is to predict which of three species of iris a plant is based on the characteristics of the flower. In our example, we’ll use a multinomial logistic regression model to predict the iris species.

Whenever we fit a logistic regression model, we have a regularization parameter C that we need to tune.

C acts as a dial that controls the complexity of the model: If we set C too low, our model won’t take full advantage of the training data; but if we set C too high, it will overfit the training data and perform poorly on out-of-sample data.

Using ALOOCV, we can estimate how well logistic regression will perform for any given value of C.

Let’s plot out ALOOCV across a range of different C values. The Iris data set is small enough that it’s possible to compute LOOCV by brute force, so we’ll plot that out also so that we can see accurate ALOOCV is.

To compute ALOOCV, we use the Python package bbai, which can be installed using pip:

pip install bbai

The Iris data already set comes packaged with sklearn. We can load and normalize the data set with this snippet of code:

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
X = np.hstack((X, np.ones((X.shape[0], 1)))) # add a bias column

Here’s how we can compute ALOOCV given a value of C

import bbai.glm

def compute_aloocv(C):
    model = bbai.glm.LogisticRegression(C=C, fit_intercept=False)
    model.fit(X, y)
    return model.aloocv_

Note that for logistic regression, cross-validation uses the negative log-likelihood as its error measurement.

This is how we can compute LOOCV

from sklearn.model_selection import LeaveOneOut

def compute_loocv(C):
    ll_sum = 0
    for train_indexes, test_indexes in LeaveOneOut().split(X):
        X_train = X[train_indexes]
        y_train = y[train_indexes]
        X_test = X[test_indexes]
        y_test = y[test_indexes]
        model = bbai.glm.LogisticRegression(C=C, fit_intercept=False)
        model.fit(X_train, y_train)
        pred = model.predict_log_proba(X_test)
        ll_sum += pred[0][y_test[0]]
    return -ll_sum / len(y)

And here’s how we can plot ALOOCV and LOOCV

import matplotlib.pyplot as plt

Cs = np.logspace(0.5, 2.5, num=100)
aloocvs = [compute_aloocv(C) for C in Cs]
loocvs = [compute_loocv(C) for C in Cs]
plt.plot(Cs, aloocvs, label='ALOOCV')
plt.plot(Cs, loocvs, label='LOOCV')
plt.xlabel('C')
plt.xscale('log')
plt.ylabel('Cross-Validation Error')
plt.legend()
plt.savefig('iris_cv.svg')

This results in

To select a value of C, we could quickly test different values of C and pick the one with the best ALOOCV value. But we can do much better.

ALOOCV isn’t just efficient to compute for a hyperparameter; it’s also possible to efficiently compute the first and second derivatives of ALOOCV with respect to hyperparameters [3]. Thus, we can apply a second-order optimizer to very quickly dial into the exact value of C that optimizes ALOOCV.

The bbai package can do all of this for us behind the scenes. Here’s how

model = bbai.glm.LogisticRegression(fit_intercept=False)
    # Note: when we don't provide a value for C, bbai.glm.LogisticRegression
    # will apply an optimizer to find the value of C with the best ALOOCV
model.fit(X, y)
print("C_opt = ", model.C_)

This prints out

C_opt =  67.38021801069182

In addition to hyperparameter optimization, ALOOCV can also tell us a lot about the training data set.

Identifying Outliers With ALOOCV

When we compute ALOOCV, as a byproduct we have approximate leave-one-out errors for each data point. If an approximate leave-one-out error is large, it indicates that the associated data point is an outlier. Outliers can be worth drilling into: We might want to use a different model for them; or they might indicate an error in data collection.

Going back to the iris data set, let’s order the data points by their approximate leave-one-out error and plot out the individual errors. We’ll also plot out the leave-one-out errors so we can see how close they are.

def compute_loocvs(C):
    cvs = []
    for train_indexes, test_indexes in LeaveOneOut().split(X):
        X_train = X[train_indexes]
        y_train = y[train_indexes]
        X_test = X[test_indexes]
        y_test = y[test_indexes]
        model = bbai.glm.LogisticRegression(C=C, fit_intercept=False)
        model.fit(X_train, y_train)
        pred = model.predict_log_proba(X_test)
        cvs.append(-pred[0][y_test[0]])
    return cvs

n = len(y)
aloocvs = model.aloocvs_
loocvs = compute_loocvs(model.C_)
indexes = list(range(n))
indexes = sorted(indexes, key=lambda i: -aloocvs[i])
aloocvs = [aloocvs[i] for i in indexes]
loocvs = [loocvs[i] for i in indexes]
ix = list(range(n))
plt.plot(ix, aloocvs, marker='x', label='ALOO', linestyle='None')
plt.plot(ix, loocvs, marker='+', label='LOO', linestyle='None')
plt.ylabel('Leave-one-out Error')
plt.xlabel('Data Point')
plt.legend()
plt.savefig('iris_loo.svg')

This displays:

Looking at this graph, we could choose a cutoff point and select points to examine further.

Complete code from this blog can be found at https://github.com/rnburn/bbai/blob/master/example/02-iris.py.

References

[1]: For a comparison of LOOCV to other forms of k-fold cross-validation, see A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation [2]: To get more details on the math behind ALOOCV, see https://buildingblock.ai/logistic-regression-guide#approximate-leave-one-out-cross-validation [3]: For details on computing the derivatives of ALOOCV and optimizing it, see Optimizing Approximate-leave-one-out Cross-validation