Mathematical Engineer | Building better models:
Leave-one-out Cross-validation (LOOCV) is one of the most accurate ways to estimate how well a model will perform on out-of-sample data. Unfortunately, it can be expensive, requiring a separate model to be fit for each point in the training data set. For the specialized cases of ridge regression, logistic regression, Poisson regression, and other generalized linear models, though, Approximate Leave-one-out Cross-validation (ALOOCV) gives us a much more efficient estimate of out-of-sample error that’s nearly as good as LOOCV.
In this post, I’ll cover:
First, a refresher on LOOCV.
Suppose we’re fitting a model to a data set of n feature vectors:
For each data entry i we form a new data set:
Enter Approximate Leave-one-out Cross-validation (ALOOCV).
I won’t give a detailed description of the math behind ALOOCV (check the references if you want that) [2], but here’s a brief description:
For the specialized case of generalized linear models, we can proceed to fit a model to the full training data set. When we do that, we find weights that optimize the model’s cost function. We can then use properties of the cost function optimum to accurately and efficiently estimate what target value the model would predict for a data entry if that data entry was removed from the training data set.
Let’s see how we might use ALOOCV.
We’ll look at the classic Iris data set. If you’re not familiar with the data set, the task is to predict which of three species of iris a plant is based on the characteristics of the flower. In our example, we’ll use a multinomial logistic regression model to predict the iris species.
Whenever we fit a logistic regression model, we have a regularization parameter C that we need to tune.
C acts as a dial that controls the complexity of the model: If we set C too low, our model won’t take full advantage of the training data; but if we set C too high, it will overfit the training data and perform poorly on out-of-sample data.
Using ALOOCV, we can estimate how well logistic regression will perform for any given value of C.
Let’s plot out ALOOCV across a range of different C values. The Iris data set is small enough that it’s possible to compute LOOCV by brute force, so we’ll plot that out also so that we can see accurate ALOOCV is.
To compute ALOOCV, we use the Python package bbai, which can be installed using pip:
pip install bbai
The Iris data already set comes packaged with sklearn. We can load and normalize the data set with this snippet of code:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
X = np.hstack((X, np.ones((X.shape[0], 1)))) # add a bias column
Here’s how we can compute ALOOCV given a value of C
import bbai.glm
def compute_aloocv(C):
model = bbai.glm.LogisticRegression(C=C, fit_intercept=False), y)
return model.aloocv_
Note that for logistic regression, cross-validation uses the negative log-likelihood as its error measurement.
This is how we can compute LOOCV
from sklearn.model_selection import LeaveOneOut
def compute_loocv(C):
ll_sum = 0
for train_indexes, test_indexes in LeaveOneOut().split(X):
X_train = X[train_indexes]
y_train = y[train_indexes]
X_test = X[test_indexes]
y_test = y[test_indexes]
model = bbai.glm.LogisticRegression(C=C, fit_intercept=False), y_train)
pred = model.predict_log_proba(X_test)
ll_sum += pred[0][y_test[0]]
return -ll_sum / len(y)
And here’s how we can plot ALOOCV and LOOCV
import matplotlib.pyplot as plt
Cs = np.logspace(0.5, 2.5, num=100)
aloocvs = [compute_aloocv(C) for C in Cs]
loocvs = [compute_loocv(C) for C in Cs]
plt.plot(Cs, aloocvs, label='ALOOCV')
plt.plot(Cs, loocvs, label='LOOCV')
plt.ylabel('Cross-Validation Error')
This results in
ALOOCV vs LOOCV for different regularization strengths
To select a value of C, we could quickly test different values of C and pick the one with the best ALOOCV value. But we can do much better.
ALOOCV isn’t just efficient to compute for a hyperparameter; it’s also possible to efficiently compute the first and second derivatives of ALOOCV with respect to hyperparameters [3]. Thus, we can apply a second-order optimizer to very quickly dial into the exact value of C that optimizes ALOOCV.
The bbai package can do all of this for us behind the scenes. Here’s how
model = bbai.glm.LogisticRegression(fit_intercept=False)
# Note: when we don't provide a value for C, bbai.glm.LogisticRegression
# will apply an optimizer to find the value of C with the best ALOOCV, y)
print("C_opt = ", model.C_)
This prints out
C_opt = 67.38021801069182
In addition to hyperparameter optimization, ALOOCV can also tell us a lot about the training data set.
When we compute ALOOCV, as a byproduct we have approximate leave-one-out errors for each data point. If an approximate leave-one-out error is large, it indicates that the associated data point is an outlier. Outliers can be worth drilling into: We might want to use a different model for them; or they might indicate an error in data collection.
Going back to the iris data set, let’s order the data points by their approximate leave-one-out error and plot out the individual errors. We’ll also plot out the leave-one-out errors so we can see how close they are.
def compute_loocvs(C):
cvs = []
for train_indexes, test_indexes in LeaveOneOut().split(X):
X_train = X[train_indexes]
y_train = y[train_indexes]
X_test = X[test_indexes]
y_test = y[test_indexes]
model = bbai.glm.LogisticRegression(C=C, fit_intercept=False), y_train)
pred = model.predict_log_proba(X_test)
return cvs
n = len(y)
aloocvs = model.aloocvs_
loocvs = compute_loocvs(model.C_)
indexes = list(range(n))
indexes = sorted(indexes, key=lambda i: -aloocvs[i])
aloocvs = [aloocvs[i] for i in indexes]
loocvs = [loocvs[i] for i in indexes]
ix = list(range(n))
plt.plot(ix, aloocvs, marker='x', label='ALOO', linestyle='None')
plt.plot(ix, loocvs, marker='+', label='LOO', linestyle='None')
plt.ylabel('Leave-one-out Error')
plt.xlabel('Data Point')
This displays:
Leave-one-out errors for C = 67.38
Looking at this graph, we could choose a cutoff point and select points to examine further.
Complete code from this blog can be found at
[1]: For a comparison of LOOCV to other forms of k-fold cross-validation, see A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation [2]: To get more details on the math behind ALOOCV, see [3]: For details on computing the derivatives of ALOOCV and optimizing it, see Optimizing Approximate-leave-one-out Cross-validation