This is Part 2 of a three-part series about creating visualizations for dissecting data and models. Part 1 can be found here and code, including a Jupyter Notebook with the visualizations in this post, is on GitHub.
When you train a classifier on a dataset, it is using a specific algorithm to define a set of hyperplanes that separates the data points into specific classes. Where the algorithm switches from one class to the other are called decision boundaries. On one side a decision boundary, a datapoint is more likely to be called as one class — on the other side of the boundary, it’s more likely to be called as another.
Boundaries are fuzzy, but they illustrate where key ‘decision points’ are made by the model.
This visualization compares 10 algorithms on three two-dimensional datasets with different intrinsic properties. Taken from scikit-learn.org.
Importantly, decision boundaries are not confined to just the data points you provided — they span through the entire feature space you trained on. The model can predict a value for any possible combination of inputs in your feature space. If the data you train on is not ‘diverse’, the overall topology of the model (decision boundaries and classification regions) will generalize poorly to new instances.
This is important to know for models you throw into production, or try to reuse on orthogonal datasets. There is nothing inherent to a machine learning model that will warn you if the model is not appropriate for another dataset. There is nothing that will tell you ‘this data point is very different from the ones I learned on.’
Understanding the limitations of existing models and the decision boundaries they learned is helpful for repurposing and reapplication, especially in instances where retraining or transfer learning are not possible.
Training a classifier requires data and an algorithm. Choosing an algorithm is an iterative and often experimental process. Rarely am I able to correctly select the appropriate algorithm that will perform best on a particular dataset on my first try.
So why is that? Why is there no ‘one model to rule them all’? Can’t we just throw a neural net at every problem?
The “No Free Lunch Theorem” states that search and optimization algorithms with excellent performance for one class of problems will not excel at others. In other words, there is no universally-useful algorithm across all data. Selecting the right approach takes intuition, an understanding of the data and goals of the analysis, practice, and time.
Average ranking of supervised learning algorithms (scikit-learn) over 165 supervised classification datasets from the Penn Machine Learning Benchmark (PMLB). Gradient boosting is excellent, but is outperformed for many datasets. This analysis also does not taken generalizability or interpretation of the model into account. Image taken from the very excellent Olson, et al., available on PubMed.
Examining decision boundaries is a great way to learn how the training data you select affects performance and the ability for your model to generalize — especially if you’re someone who learns tactilely. Visualization of decision boundaries can illustrate how sensitive models are to each dataset. And it’s a great way to intuitively understand how specific algorithms work, and their limitations for specific datasets.
In Part 1, I discussed using Bokeh to generate interactive PCA reports. Here I’ll discuss how to use Bokeh to generate decision boundary plots.
Training a K-nearest neighbor classifier (K=3) on the first two principal components of the iris dataset. Each class has its own color, with corresponding decision boundaries. Data points with outlines indicate test data with the model’s prediction as the outline. In the case where the outline is a different color, the model misclassified that datapoint.
My goals for this visualization tool were three-fold. Given a model and a dataset:
In addition, the code should be as generalizable as possible — it should accept any sci-kit learn classifier and any dataset with any number of classes. Note that a limitation of the approach is that it only works on two-dimensional data currently, so transforming the data (with e.g. PCA) is necessary. In the future, I may explore visualizing multi-dimensional decision boundaries in two/three human-interpretable dimensions.
Below, I’ll walk through key components of the visualization.
A major component of the tool is automatically generating the ‘mesh grid’. This is a set of coordinates upon which the model will make predictions, which are then visualized to reveal the decision boundaries.
I designed the mesh grid algorithm such that it is tuned by the data itself. Two key aspects of the mesh grid are how far apart they are (step, or step size) and the window in which you want to visualize predictions (bound).
I use a bound and steps to define a new 2D array in the existing coordinate system.
To generate bound and step values, I used the average distance between data points within each axis of the 2D dataset:
# set bound as a % of average of ranges for x and y bound = 0.1*np.average(np.ptp(matrix_2D, axis=0)) # set step size as a % of the average of ranges for x and y step = 0.05*np.average(np.ptp(matrix_2D, axis=0))
matrix_2D is the 2D dataset. The numpy.ptp function calculated the peak-to-peak distances in your data along a specified axis. The step size defines the resolution of the mesh grid. The smaller the step size between data points, the larger the resolution.
With bound, I then generate the window for populating the mesh:
# get boundaries x_min = matrix_2D[:,0].min() - bound x_max = matrix_2D[:,0].max() + bound y_min = matrix_2D[:,1].min() - bound y_max = matrix_2D[:,1].max() + bound
The final mesh is generated by creating a value with the bounds at each step. Here, the numpy.meshgrid() function comes in handy:
mesh = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))
Finally, you can simply use a trained classifier to predict values across the entire mesh:
predictions = trained_clf.predict(np.c_[mesh.ravel(), mesh.ravel()]) predictions = predictions.reshape(mesh.shape)
Dataframe of a meshgrid with predictions.
The other fun part of developing the tool was figuring out how to visualize data points that were used for training and testing. I iterated over a few design choices and settled on a final one:
I’m pretty happy with the results, which you can see below. The final UI makes it easy to see which data were used for train/test and immediately see where the model does a poorer job.
Decision boundary for the 2D PCA of the iris dataset, zoomed in at the virginica (yellow)/versicolor (green) interface. The HoverTool shows is shown over a datapoint in the ‘test data.’ The ‘truth’ for this point is virginica but the predicted value is versicolor.
There’s more detail on how to implement this UI element in the Jupyter Notebook, including how to add additional touches like HoverTool labels. Feel free to have a look and play with it!
Interpolation is estimating properties of data within the boundaries of a given dataset. Extrapolating is looking beyond your dataset to make accurate predictions.
An example of interpolation is estimating a missing feature in your dataset for a particular instance. Say you have developed a new material, and you’re able to measure its thermal properties but not certain conductive ones. Using a model, the missing property values could be estimated based on the other factors. But what if the material is very unique from the data you have in hand? You have to trust that the model is able to generalize well and can extrapolate the missing value.
Application of an existing model to new data points is very dependent on the decision boundaries of the model. Let’s look at a simple example.
Two Random Forest models trained on the same data, including the same training and test split. On the right, the model has a higher accuracy, but also a strange decision boundary.
Above, you can see two Random Forest models trained on the same data, including the same training and test split. On the right, the model has a higher accuracy, but also a strange decision boundary. At PC1=2, as you move from PC2=-1 to PC2=1, the model will switch from versicolor to virginica to versicolor to virginica. Importantly, this strange decision boundary is close to actual instances of the data. Though the model on the right is more accurate, the decision boundary will result in very different predictions and the model may be overfit.
I want to finish this post by discussing data sets requiring non-linear decision boundaries, and fold in a little about how dimensionality reduction techniques can be used to expand which models can be used.
The Swiss Roll is often used to demonstrate manifold learning techniques and the limitations of common clustering methods (such as agglomerative clustering or k-nearest neighbors) on complex manifolds.
A 2D swiss roll with two classes (purple, red). Try to cleanly separate the purple and the red bits with a linear function — you can’t! This is a simple example where non-linear functions will be valuable.
As one would expect, a linear function does a very poor job in defining a decision boundary that properly splits these data. Here’s Logistic Regression in action:
It’s doing its best.
Meanwhile, Random Forest has no problem on these data. It essentially carves a box around the center (though it’s not learning anything about the ‘curvature’ of the data).
Random Forest ‘decides’ that anything within a specific set of bounds is green.
Though it’s entirely inappropriate for these data, let’s look at how a multi-layer perceptron (or feed-forward neural network) behaves on these data:
Results from a 16 x 16 perceptron.
As you can see, we approach 100% accuracy, and then decision boundary in the center overlaps with Random Forest strongly. However, you can see the upper left corner is all green. The ‘green’ decision boundary will extend out into infinity in that direction.
You may be asking, “Why do we care about any of this? My data is not a swiss roll.” Here’s what I think: