This is Part 2 of a three-part series about creating visualizations for dissecting data and models. Part 1 can be found here and code, including a Jupyter Notebook with the visualizations in this post, is on GitHub.
When you train a classifier on a dataset, it is using a specific algorithm to define a set of hyperplanes that separates the data points into specific classes. Where the algorithm switches from one class to the other are called decision boundaries. On one side a decision boundary, a datapoint is more likely to be called as one class — on the other side of the boundary, it’s more likely to be called as another.
Boundaries are fuzzy, but they illustrate where key ‘decision points’ are made by the model.
Importantly, decision boundaries are not confined to just the data points you provided — they span through the entire feature space you trained on. The model can predict a value for any possible combination of inputs in your feature space. If the data you train on is not ‘diverse’, the overall topology of the model (decision boundaries and classification regions) will generalize poorly to new instances.
This is important to know for models you throw into production, or try to reuse on orthogonal datasets. There is nothing inherent to a machine learning model that will warn you if the model is not appropriate for another dataset. There is nothing that will tell you ‘this data point is very different from the ones I learned on.’
Understanding the limitations of existing models and the decision boundaries they learned is helpful for repurposing and reapplication, especially in instances where retraining or transfer learning are not possible.
Training a classifier requires data and an algorithm. Choosing an algorithm is an iterative and often experimental process. Rarely am I able to correctly select the appropriate algorithm that will perform best on a particular dataset on my first try.
So why is that? Why is there no ‘one model to rule them all’? Can’t we just throw a neural net at every problem?
The “No Free Lunch Theorem” states that search and optimization algorithms with excellent performance for one class of problems will not excel at others. In other words, there is no universally-useful algorithm across all data. Selecting the right approach takes intuition, an understanding of the data and goals of the analysis, practice, and time.
Examining decision boundaries is a great way to learn how the training data you select affects performance and the ability for your model to generalize — especially if you’re someone who learns tactilely. Visualization of decision boundaries can illustrate how sensitive models are to each dataset. And it’s a great way to intuitively understand how specific algorithms work, and their limitations for specific datasets.
Decision Boundary Plots in Bokeh
In Part 1, I discussed using Bokeh to generate interactive PCA reports. Here I’ll discuss how to use Bokeh to generate decision boundary plots.
My goals for this visualization tool were three-fold. Given a model and a dataset:
- I want to see which data points were used for training, to understand if the distribution was appropriate
- I want to see which data points were used for testing, and ‘where’ the classifier is challenged with accurate predictions
- I want to see the decision boundaries for each class, to understand if/how the model will be useful for prediction for future, unseen data points
In addition, the code should be as generalizable as possible — it should accept any sci-kit learn classifier and any dataset with any number of classes. Note that a limitation of the approach is that it only works on two-dimensional data currently, so transforming the data (with e.g. PCA) is necessary. In the future, I may explore visualizing multi-dimensional decision boundaries in two/three human-interpretable dimensions.
Below, I’ll walk through key components of the visualization.
A major component of the tool is automatically generating the ‘mesh grid’. This is a set of coordinates upon which the model will make predictions, which are then visualized to reveal the decision boundaries.
I designed the mesh grid algorithm such that it is tuned by the data itself. Two key aspects of the mesh grid are how far apart they are (step, or step size) and the window in which you want to visualize predictions (bound).
To generate bound and step values, I used the average distance between data points within each axis of the 2D dataset:
# set bound as a % of average of ranges for x and y bound = 0.1*np.average(np.ptp(matrix_2D, axis=0)) # set step size as a % of the average of ranges for x and y step = 0.05*np.average(np.ptp(matrix_2D, axis=0))
matrix_2D is the 2D dataset. The numpy.ptp function calculated the peak-to-peak distances in your data along a specified axis. The step size defines the resolution of the mesh grid. The smaller the step size between data points, the larger the resolution.
With bound, I then generate the window for populating the mesh:
# get boundaries x_min = matrix_2D[:,0].min() - bound x_max = matrix_2D[:,0].max() + bound y_min = matrix_2D[:,1].min() - bound y_max = matrix_2D[:,1].max() + bound
The final mesh is generated by creating a value with the bounds at each step. Here, the numpy.meshgrid() function comes in handy:
mesh = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))
Finally, you can simply use a trained classifier to predict values across the entire mesh:
predictions = trained_clf.predict(np.c_[mesh.ravel(), mesh.ravel()]) predictions = predictions.reshape(mesh.shape)
Training and Test Data
The other fun part of developing the tool was figuring out how to visualize data points that were used for training and testing. I iterated over a few design choices and settled on a final one:
- Training data and test data will be colored by their class
- Test data will have an additional element: a bold outline indicating the predicted class for that data point
- A HoverTool will give the true class and the predicted one
I’m pretty happy with the results, which you can see below. The final UI makes it easy to see which data were used for train/test and immediately see where the model does a poorer job.
There’s more detail on how to implement this UI element in the Jupyter Notebook, including how to add additional touches like HoverTool labels. Feel free to have a look and play with it!
Interpolation is estimating properties of data within the boundaries of a given dataset. Extrapolating is looking beyond your dataset to make accurate predictions.
An example of interpolation is estimating a missing feature in your dataset for a particular instance. Say you have developed a new material, and you’re able to measure its thermal properties but not certain conductive ones. Using a model, the missing property values could be estimated based on the other factors. But what if the material is very unique from the data you have in hand? You have to trust that the model is able to generalize well and can extrapolate the missing value.
Application of an existing model to new data points is very dependent on the decision boundaries of the model. Let’s look at a simple example.
Above, you can see two Random Forest models trained on the same data, including the same training and test split. On the right, the model has a higher accuracy, but also a strange decision boundary. At PC1=2, as you move from PC2=-1 to PC2=1, the model will switch from versicolor to virginica to versicolor to virginica. Importantly, this strange decision boundary is close to actual instances of the data. Though the model on the right is more accurate, the decision boundary will result in very different predictions and the model may be overfit.
I want to finish this post by discussing data sets requiring non-linear decision boundaries, and fold in a little about how dimensionality reduction techniques can be used to expand which models can be used.
The Swiss Roll is often used to demonstrate manifold learning techniques and the limitations of common clustering methods (such as agglomerative clustering or k-nearest neighbors) on complex manifolds.
As one would expect, a linear function does a very poor job in defining a decision boundary that properly splits these data. Here’s Logistic Regression in action:
Meanwhile, Random Forest has no problem on these data. It essentially carves a box around the center (though it’s not learning anything about the ‘curvature’ of the data).
Though it’s entirely inappropriate for these data, let’s look at how a multi-layer perceptron (or feed-forward neural network) behaves on these data:
As you can see, we approach 100% accuracy, and then decision boundary in the center overlaps with Random Forest strongly. However, you can see the upper left corner is all green. The ‘green’ decision boundary will extend out into infinity in that direction.
You may be asking, “Why do we care about any of this? My data is not a swiss roll.” Here’s what I think:
- ‘Accuracy’ is a poor metric for the fitness of a model for all use cases. Overfitting is a common challenge and understanding overfitting takes practice and intuition. Going beyond single numbers to build ‘model sense’ and ‘data sense’ are helpful for me to gain better intuition about data.
- Models get repurposed often. The generalizability of a model matters, particularly when deploying in dynamic systems (e.g. an advertising platform with new products on market, a healthcare system with new treatments coming out each month).
- Robust models that allow you to simulate state-changes in a system fascinate me (what if the patient received treatment Y instead of X? what if they had a mutation in this gene versus that?) but are tricky to achieve in practice — decision boundaries strongly affect ‘what if’ predictions.