Diving deep into Supervised Learning š
This is Part 2 of the ongoing series Machine Learning with JavaScript. Hereās Part 1.
Itās kNN time.
kNN stands for k-Nearest-Neighbours, which is a Supervised learning algorithm. It can be used for classification, as well as regression problems. First, we are gonna say hello to kNN but if you want, you can skip ahead to the code. GitHub Repository: Machine Learning with JS.
kNN decides the class of the new data point based on the maximum number of neighbors the data point has that belong to the same class.
If the neighbors of a new data point are as follows, NY: 7, NJ: 0, IN: 4, then the class of the new data point will be NY.
Letās say you work at a Post Office and your job is to organize and distribute letters among Postmen so as to minimize the number of trips to the different neighborhoods. And since we are just imagining stuff, we can assume that there are only seven different neighborhoods. This is a kind of classification problem. You need to divide the letters into classes, where classes here refer to Upper East Side, Downtown Manhattan, and so on.
If you love wasting time and resources, you might give one letter from every neighborhood to each Postman, and hope that they meet each other in the same neighborhood and discover your corrupt plan. Thatās the worst kind of distribution you could achieve.
On the other hand, you could organize the letters based on which addresses are close to each other.
You might start with āIf itās within a three block range, give it to the same Postman.ā That number of nearest blocks is where **k**
comes from. You can keep increasing the number of blocks until you hit an efficient distribution. Thatās the most efficient value of k for your classification problem.
So, based on some parameter(s), like the address of the house here, you classified whether a letter belongs to Downtown Manhattan, Times Square, et cetera. (I am not good with names, so)
As we did in the last tutorial, we are going to use ml.jsās KNN module to train our kNearestNeighbors classifier. Every Machine Learning problem needs data, and we are gonna use the IRIS dataset in this tutorial.
The Iris dataset consists of 3 different types of irisesā (Setosa, Versicolour, and Virginica) petal and sepal length, along with a field signifying their respective type.
Step 1. Install the libraries
$ yarn add ml-knn csvtojson prompt
Or if you like npm
$ npm install ml-knn csvtojson prompt
[ml-knn](https://github.com/mljs/knn)
: k Nearest Neighbors
[csvtojson](https://github.com/Keyang/node-csvtojson)
: Parse data
[prompt](https://github.com/flatiron/prompt)
: To allow user prompts for predictions
Step 2. Initialize the library and load the Data
The Iris dataset is provided by the University of California, Irvine and is available here. However, because of the way itās organized, you are gonna have to copy the content in the browser (Select All | Copy) and paste it into a file named iris.csv. You can name it whatever you want, except that the extension must beĀ .csv.
Now, initialize the library and load the data. I am assuming you already have an empty npm project set-up, but if you are not familiar with it, hereās a quick intro.
The header names
are used for visualization and understanding. They will be removed later.
Also,seperationSize
is used to split the data into training and test datasets.
Cool, eh?
We imported the csvtojson
package, and now we are going to use its fromFile
method to load the data. (Since our data doesnāt have a header row, we are providing our own header names.)
We are pushing each row to the data variable, and when the process is done, we are setting the seperationSize
to 0.7 times the number of samples in our dataset. Note that, if the size of the training samples is too low, the classifier may not perform as well as it would with a larger set.
Since our dataset is sorted with respect to types(console.log
to confirm), the shuffleArray
function is used to, well, shuffle the dataset to allow splitting. (If you donāt shuffle, you might end up with a model which works fine for the first two classes, but fails with the third.)
Hereās how it is defined. I got it from an answer over at StackOverflow.
Step 3. Dress Data (yet again)
Our data is organized as follows:
{sepalLength: ā5.1ā,sepalWidth: ā3.5ā,petalLength: ā1.4ā,petalWidth: ā0.2ā,type: āIris-setosaā}
There are two things we need to do to our data before we serve it to the kNN classifier:
parseFloat
)type
into numbered classes. (Computers like numbers, you know?)If you are not familiar with Sets, they are just like their mathematical counterparts, as in they canāt have duplicate elements, and their elements do not have an index. (As opposed to Arrays.)
And they can be easily converted to Arrays using the spread
operator or the by using the Set constructor.
Data has been dressed, wands at the readyāāāExpelliarmus:
The train
method takes two mandatory arguments, the input data, such as the Petal Length, Sepal Width, and itās actual class, such as Iris-setosa, and so on. It also takes an optional options parameter, which is just a JS object that can be passed to tweak the internal parameters of the algorithm. We are passing the value of **k**
as an option. The default value of k
is 5.
Now that our model has been trained, letās see how it performs on the test set. Mainly, we are interested in the number of misclassifications that occur. (That is, number of times it predicts the input to be something, even though itās actually something else.)
The error is calculated as follows. We use the humble for-loop to loop over the dataset, and see if the predicted output is not equal to the actual output. Thatās a misclassification.
Step 5. (Optional) Start Predicting
Itās time to have some prompts and predictions.
Feel free to skip this step, if you donāt want to test out the model on new input.
If you followed the steps, this is how your index.js should look:
Go fire a terminal š», and run node index.js.
$ node index.js
Test Set Size = 45 and number of Misclassifications = 2prompt: Sepal Length: 1.7prompt: Sepal Width: 2.5prompt: Petal Length: 0.5prompt: Petal Width: 3.4With 1.7,2.5,0.5,3.4 -- type = 2
Well done. Thatās your kNN algorithm at work, classifying like a charm. š¹
All the code is on Github: machine-learning-with-js
A huge aspect of the kNN algorithm is the value of k, and it is referred to as a hyperparameter. Hyperparameters are a, and I paraphrase from this answer on Quora, ākind of parameters that cannot be directly learned from the regular training process. These parameters express āhigher-levelā properties of the model such as its complexity or how fast it should learn. They are called hyperparameters.ā
k defines how many blocks in the neighborhood of the address should be considered to classifyĀ it.
I am working on the ml-knn
module and hopefully, the process of choosing k will be automated pretty soon.
If you are kinda excited and want to see what this can do, you can go to UC Irvine Machine Learning Repository and use your classifier on a different dataset. (That repository has hundreds.)
PS: To get the latest articles in this series, keep an eye on my profile, or you could cut yourself some slack and follow me. š
Thanks for reading! If you liked it, hit the green button ā¤ļø to let others know about how powerful JS is and why it shouldnāt be lagging behind when it comes to Machine Learning.