K Nearest Neighbor Predictive Classification Javascript implementation Utilizing D3 Visualizations At its heart, the k Nearest Neighbor technique treats the problems it receives as issues of pattern recognition in order to properly classify data. It does this by ‘learning’ the graphical space in which the different categories reside; implicit is the assumption that each classes will be clustered distinctly. Computers learn similarly to living things, based on as much previous data as is available, predictions are made in order to do work more efficiently. Typically, has not been thought of as a good language for the implementation of algorithms, and for good reason. When first developed, the language was incredibly inefficient and a misunderstanding of underlying concepts or blatant disregard for best practices led to a proliferation of terrible libraries and examples. Further, the community had been leaning towards OOP, but there has slowly been a movement to a more functional style. Over the years optimizations have been made, libraries have been refactored or newly created in ways that lend to performing complex processing, the underlying language implementation has had several iterations, and computing machines are also equipped with much greater processing capabilities. Javascript machine learning All of this leads to an interesting space in which we, as Javascript developers, find ourselves. On the cusp of a rapidly transforming widespread technology, the possible applications of which are expanding, we have the opportunity to contribute to and improve a technology which is becoming increasingly fundamental to our daily experience. D3 has been one of the libraries most widely used with the goal of data visualization, and is depended upon several widespread visualization tools today; this is what will be used to plot out k Nearest Neighbors algorithm. If you need an introduction to the library before applying it to something more involved, an introducing it might of use. earlier post of mine Now that we’re all excited about our role in this field, lets explore a simple implementation of arguably the most fundamental machine learning algorithm, k Nearest Neighbor. While on the less complex side of predictive computation algorithms, it is widely used in the data analysis industry, and is one of the best performing for several problems, one such example is the prediction of behavior between proteins, or protein-protein interaction (PPI): “In spite of its simplicity to implement, when a large data set or numerous features are used the computational cost and memory requirement grows rapidly. This method has been used (not widely) in PPI prediction problem.” - , 2013 Zahiri, Bozorgmehr, Masoudi-Nejad To illustrate this, lets explore the example pictured to the left, for which can be cloned down and explored, or you can play with it in ! Note that the dummy data is dumb; as should be inferred, they are not from a valid scientific study nor do the predictions represent a medical diagnosis in reality. It is loosely based on the findings of , and related published research outlining the correlation. which indicates that high levels of the proteins CA15–3 and CA27–29 could indicate the presence of a tumor in the breast; it should also be noted that these tracers also get spiked in cases of ovarian, liver, and other breast diseases. Units are calculated in units per milliliter of blood, with levels over 30U/mL considered high-risk. The training data can be changed in the source code, and the graph will handle the new data by fitting it to the graph as long as it is in the correct format: {x: num, y: num, neighbors: array, identity: num}. the repo this CodePen this medical study To create this graph, there will be a series of functional tasks to accomplish, each of which my be solved differently, and modularity may be implemented in the functions to various degrees or with a different opinions with regards to the ideal separation of concerns. The following is my own opinion on how to best organize the program. The pieces we will address as individual problems in the construction of our larger program include: Elements on the page: A board Training data points Input data points Classification functions: Calibrate(smooth) data to fit to graph Find k nearest neighbors Classify each datum Placement functions: Place training data Place input data OPTIONAL: place circle containing input’s k nearest neighbors To create the board, simply located the element on the page to which you want to append: var = 300;var = 300;var = d3.select(document.getElementById(‘body’)).append(“svg”).attr(“width”, w).attr(“height”, h); h w svgSelection Don’t forget to add a border: var = svgSelection.append(‘rect’).attr(‘x’, 0).attr(‘y’, 0).attr(‘height’, h).attr(‘width’, w).style(“stroke”, ‘black’).style(“fill”, “none”).style(“stroke-width”, 4); border At the top of the js page, include the following variables: and are the dimensions of the board, is an object which holds the highest values for the each of the training variables, into , each data point will be pushed it so that future inputs includes training and input data, and the correspond with the various possible states of classification. h w max addData riskColors To figure out where to place the points on the graph, we need to perform several operations in order to fit the data’s range within the size of the graph. The helper functions and will be utilized in our placement functions. Lets build those out: findMax smooth var = (data) => {var results = {x: 0, y: 0};data.map(datum => {let x = datum.x;let y = datum.y;x > results.x ? results.x = x : false;y > results.y ? results.y = y : false;});return results;} findMax var = (datum, max) => {datum.x = (datum.x/max.x) * h;datum.y = (datum.y/max.y) * h;return datum;} smooth Next, we need a function that will draw the points of data on the graph: var = [];//global container for training and input datavar = [‘mediumturquoise’, ‘crimson’, ‘rebeccapurple’];//indices of riskColors correspond with a classification allData riskColors var = (datum, encircle) => {svgSelection.append('circle').attr('cx', datum.x).attr('cy', datum.y).attr('r', 5).attr('fill', riskColors[datum.identity])encircle ? drawHalo(datum) : false;//draw halo implemented below!}var ;var = (data, adjust=true, encircle=false) => {max === undefined ? max = findMax(data) : max = max;data.map(datum => {allData.push(datum);adjust ? smooth(datum, max) : null;placeDatum(datum, encircle);});} placeDatum max populateGraph will be used set to true for input data, invoking in the , which draws a circle around the data point containing its nearest neighbors. At this point, all that will be on the screen is a blank rectangle. All this needs now is a set of training data and to be invoked. At the bottom of the file, place: Encircle drawHalo placeDatum k //this may be replaced with any data in which there are//identified objects with the three properties found below var = [{x: 32,y: 40, identity: 1},{x: 5,y: 27, identity: 0},{x: 17,y: 9, identity: 0},{x: 50, y: 29, identity: 1},{x: 22,y: 46, identity: 1},{x: 10,y: 27, identity: 0},{x: 5,y: 9, identity: 0}]; trainingData populateGraph(trainingData); The dummy data should now render to the page, clustered and colored. This is still pretty boring, to make it interesting some interface for user interactivity needs to be created. In the body of the HTML, we’ll be adding a form with the id which contains input bars for each of the variables: submitData <div id=”submitData”><form onsubmit=”handleSubmit(event)”><b>Add data:</b> <br/>x:<input id=”x” type=”text” placeholder=”CA15–3 in U/mL”/><br/>y:<input id=”y” type=”text” placeholder=”CA27–29 in U/mL”><button>Submit</button></form></div> The handler should prevent the page from re-rendering, get the inputs from the page, correctly format them to an object to then be passed into a drawing function as a data point: var = function(e) {e.preventDefault();var x = document.getElementById(‘x’).value;var y = document.getElementById(‘y’).value;var = {x: x,y: y,neighbors: [],identity: 0};classifyAndPlot([datum]);//classify and plot not yet defineddocument.getElementById(‘x’).value = ‘’;document.getElementById(‘y’).value = ‘’;//clear the input bars} handleSubmit datum The formatted data then is passed into a function to categorize them based on the points closest to them, which is calculated by looping through all of the data points, calculating distance using good old Pythagoras’s formula. Lets all say it together: “a squared plus b squared equals c squared.” Good! Except this time, it’ll be (xDiff²+yDiff²): √ var = function(data, k) {return data.map(datum => {allData.map(neighbor => {var xDiff = Math.abs(datum.x — neighbor.x);var yDiff = Math.abs(datum.y — neighbor.y);var distance = Math.sqrt(Math.pow(xDiff, 2) + Math.pow(yDiff, 2));var currentFartherNeighbor = datum.neighbors.find(dataNeighbor => dataNeighbor[0] > distance)datum.neighbors.length < k ?datum.neighbors.push([distance, neighbor.identity]) :currentFartherNeighbor ?datum.neighbors.splice(datum.neighbors.indexOf(currentFartherNeighbor), 1, [distance, neighbor.identity]) :false;datum.neighbors.sort((a, b) => b[0] — a[0]);return datum;});});} findNeighbors The above functions loops over the input data, and compares each datum to each of the array, pushing into the neighbors array if there are less then neighbors or if a comparison yields a distance smaller than any currently in the array. In the case of the latter, the farthest neighbor also needs to be spliced out. Then the datum is returned with the neighbors array populated as expected. However, one more component is necessary in order to paint them on the page. We have functions to do the later which may be reused, but before passing data into them to render onto the page, we should consider what the input looks like when this function is invoked, and how to format the input to be successfully accepted by the function. allData k populateGraph var = (data) => {//fit the data to the graph rangedata.map(datum => smooth(datum, max));//populate neighborsfindNeighbors(data, 3);var formattedData = data.map(datum => {var neighborsIdentity = datum.neighbors.reduce((acc, neighbor) => {acc[neighbor[1]] === undefined ? acc[neighbor[1]] = 1 :acc[neighbor[1]] = acc[neighbor[1]] + 1;return acc;}, []);var maximum = Math.max(…neighborsIdentity.filter(neighbor => neighbor));datum.identity = neighborsIdentity.indexOf(maximum);return datum;});populateGraph(formattedData, false, true);} classifyAndPlot That’s it! We have everything we need to train our algorithm and predict unclassified input results based on present clustering trends. While it has been noted that this is not the most efficient solution to most machine learning problems, but is useful in cases where there are discrete categories with clear distinctions. As someone interested in medicine and public health, I’ve learned that cancer diagnosis is one of those cases, in which you are assigned either a positive or negative diagnosis based on various biometric measures related to understood oncological trends. This is the idea that the example data is modeled after, but it can be modified for any training data in which each datum has been classified. What are some ideas for application that come to mind for you? Let me know below in the comments. Thanks for reading, now go make your own improved implementations! Jacob

Associative Machine Learning: Predicting Cancer Risk

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Understanding the Problem: JavaScript Maths

A Brighter Future for Cancer Patients: 3 Biotech Companies Developing Immunotherapy

A Small Clinical Trial Brings Optimistic News for Rectal Cancer Patients

🔮 After Uber, Snapchat’s boom & tech ethics #103

Be Good To Each Other

Understanding the Problem: JavaScript Maths

A Brighter Future for Cancer Patients: 3 Biotech Companies Developing Immunotherapy

A Small Clinical Trial Brings Optimistic News for Rectal Cancer Patients

🔮 After Uber, Snapchat’s boom & tech ethics #103

Be Good To Each Other

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps