Diving deep into Supervised Learning 🏊
It’s kNN time.
kNN stands for k-Nearest-Neighbours, which is a Supervised learning algorithm. It can be used for classification, as well as regression problems. First, we are gonna say hello to kNN but if you want, you can skip ahead to the code. GitHub Repository: Machine Learning with JS.
How does kNN Algorithm work?
kNN decides the class of the new data point based on the maximum number of neighbors the data point has that belong to the same class.
If the neighbors of a new data point are as follows, NY: 7, NJ: 0, IN: 4, then the class of the new data point will be NY.
Let’s say you work at a Post Office and your job is to organize and distribute letters among Postmen so as to minimize the number of trips to the different neighborhoods. And since we are just imagining stuff, we can assume that there are only seven different neighborhoods. This is a kind of classification problem. You need to divide the letters into classes, where classes here refer to Upper East Side, Downtown Manhattan, and so on.
If you love wasting time and resources, you might give one letter from every neighborhood to each Postman, and hope that they meet each other in the same neighborhood and discover your corrupt plan. That’s the worst kind of distribution you could achieve.
On the other hand, you could organize the letters based on which addresses are close to each other.
You might start with “If it’s within a three block range, give it to the same Postman.” That number of nearest blocks is where
k comes from. You can keep increasing the number of blocks until you hit an efficient distribution. That’s the most efficient value of k for your classification problem.
So, based on some parameter(s), like the address of the house here, you classified whether a letter belongs to Downtown Manhattan, Times Square, et cetera. (I am not good with names, so)
kNN in practice | Code
As we did in the last tutorial, we are going to use ml.js’s KNN module to train our kNearestNeighbors classifier. Every Machine Learning problem needs data, and we are gonna use the IRIS dataset in this tutorial.
The Iris dataset consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, along with a field signifying their respective type.
Step 1. Install the libraries
$ yarn add ml-knn csvtojson prompt
Or if you like
$ npm install ml-knn csvtojson prompt
ml-knn: k Nearest Neighbors
csvtojson: Parse data
prompt: To allow user prompts for predictions
Step 2. Initialize the library and load the Data
The Iris dataset is provided by the University of California, Irvine and is available here. However, because of the way it’s organized, you are gonna have to copy the content in the browser (Select All | Copy) and paste it into a file named iris.csv. You can name it whatever you want, except that the extension must be .csv.
Now, initialize the library and load the data. I am assuming you already have an empty npm project set-up, but if you are not familiar with it, here’s a quick intro.
header names are used for visualization and understanding. They will be removed later.
seperationSize is used to split the data into training and test datasets.
We imported the
csvtojson package, and now we are going to use its
fromFile method to load the data. (Since our data doesn’t have a header row, we are providing our own header names.)
We are pushing each row to the data variable, and when the process is done, we are setting the
seperationSize to 0.7 times the number of samples in our dataset. Note that, if the size of the training samples is too low, the classifier may not perform as well as it would with a larger set.
Since our dataset is sorted with respect to types(
console.log to confirm), the
shuffleArray function is used to, well, shuffle the dataset to allow splitting. (If you don’t shuffle, you might end up with a model which works fine for the first two classes, but fails with the third.)
Here’s how it is defined. I got it from an answer over at StackOverflow.
Step 3. Dress Data (yet again)
Our data is organized as follows:
There are two things we need to do to our data before we serve it to the kNN classifier:
- Turn the String values to floats. (
- Turn the
typeinto numbered classes. (Computers like numbers, you know?)
If you are not familiar with Sets, they are just like their mathematical counterparts, as in they can’t have duplicate elements, and their elements do not have an index. (As opposed to Arrays.)
And they can be easily converted to Arrays using the
spread operator or the by using the Set constructor.
Step 4. Train your model and then test it
Data has been dressed, wands at the ready — Expelliarmus:
train method takes two mandatory arguments, the input data, such as the Petal Length, Sepal Width, and it’s actual class, such as Iris-setosa, and so on. It also takes an optional options parameter, which is just a JS object that can be passed to tweak the internal parameters of the algorithm. We are passing the value of
k as an option. The default value of
k is 5.
Now that our model has been trained, let’s see how it performs on the test set. Mainly, we are interested in the number of misclassifications that occur. (That is, number of times it predicts the input to be something, even though it’s actually something else.)
The error is calculated as follows. We use the humble for-loop to loop over the dataset, and see if the predicted output is not equal to the actual output. That’s a misclassification.
Step 5. (Optional) Start Predicting
It’s time to have some prompts and predictions.
Feel free to skip this step, if you don’t want to test out the model on new input.
Step 6. Boom-shaw-shey-Done. 🚀
If you followed the steps, this is how your index.js should look:
Go fire a terminal 💻, and run
$ node index.js
Test Set Size = 45 and number of Misclassifications = 2
prompt: Sepal Length: 1.7
prompt: Sepal Width: 2.5
prompt: Petal Length: 0.5
prompt: Petal Width: 3.4
With 1.7,2.5,0.5,3.4 -- type = 2
Well done. That’s your kNN algorithm at work, classifying like a charm. 💹
All the code is on Github: machine-learning-with-js
A huge aspect of the kNN algorithm is the value of k, and it is referred to as a hyperparameter. Hyperparameters are a, and I paraphrase from this answer on Quora, “kind of parameters that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. They are called hyperparameters.”
k defines how many blocks in the neighborhood of the address should be considered to classify it.
I am working on the
ml-knn module and hopefully, the process of choosing k will be automated pretty soon.
If you are kinda excited and want to see what this can do, you can go to UC Irvine Machine Learning Repository and use your classifier on a different dataset. (That repository has hundreds.)
PS: To get the latest articles in this series, keep an eye on my profile, or you could cut yourself some slack and follow me. 😄
Thanks for reading! If you liked it, hit the green button ❤️ to let others know about how powerful JS is and why it shouldn’t be lagging behind when it comes to Machine Learning.