Associative Machine Learning: Predicting Cancer Risk

K Nearest Neighbor Predictive Classification Javascript implementation Utilizing D3 Visualizations

At its heart, the k Nearest Neighbor technique treats the problems it receives as issues of pattern recognition in order to properly classify data. It does this by ‘learning’ the graphical space in which the different categories reside; implicit is the assumption that each classes will be clustered distinctly. Computers learn similarly to living things, based on as much previous data as is available, predictions are made in order to do work more efficiently.

Typically, Javascript has not been thought of as a good language for the implementation of machine learning algorithms, and for good reason. When first developed, the language was incredibly inefficient and a misunderstanding of underlying concepts or blatant disregard for best practices led to a proliferation of terrible libraries and examples. Further, the community had been leaning towards OOP, but there has slowly been a movement to a more functional style. Over the years optimizations have been made, libraries have been refactored or newly created in ways that lend to performing complex processing, the underlying language implementation has had several iterations, and computing machines are also equipped with much greater processing capabilities.

All of this leads to an interesting space in which we, as Javascript developers, find ourselves. On the cusp of a rapidly transforming widespread technology, the possible applications of which are expanding, we have the opportunity to contribute to and improve a technology which is becoming increasingly fundamental to our daily experience. D3 has been one of the libraries most widely used with the goal of data visualization, and is depended upon several widespread visualization tools today; this is what will be used to plot out k Nearest Neighbors algorithm. If you need an introduction to the library before applying it to something more involved, an earlier post of mine introducing it might of use.

Now that we’re all excited about our role in this field, lets explore a simple implementation of arguably the most fundamental machine learning algorithm, k Nearest Neighbor. While on the less complex side of predictive computation algorithms, it is widely used in the data analysis industry, and is one of the best performing for several problems, one such example is the prediction of behavior between proteins, or protein-protein interaction (PPI):

“In spite of its simplicity to implement, when a large data set or numerous features are used the computational cost and memory requirement grows rapidly. This method has been used (not widely) in PPI prediction problem.” -Zahiri, Bozorgmehr, Masoudi-Nejad, 2013

To illustrate this, lets explore the example pictured to the left, the repo for which can be cloned down and explored, or you can play with it in this CodePen! Note that the dummy data is dumb; as should be inferred, they are not from a valid scientific study nor do the predictions represent a medical diagnosis in reality. It is loosely based on the findings of this medical study, and related published research outlining the correlation. which indicates that high levels of the proteins CA15–3 and CA27–29 could indicate the presence of a tumor in the breast; it should also be noted that these tracers also get spiked in cases of ovarian, liver, and other breast diseases. Units are calculated in units per milliliter of blood, with levels over 30U/mL considered high-risk. The training data can be changed in the source code, and the graph will handle the new data by fitting it to the graph as long as it is in the correct format: {x: num, y: num, neighbors: array, identity: num}.

To create this graph, there will be a series of functional tasks to accomplish, each of which my be solved differently, and modularity may be implemented in the functions to various degrees or with a different opinions with regards to the ideal separation of concerns. The following is my own opinion on how to best organize the program. The pieces we will address as individual problems in the construction of our larger program include:

Elements on the page:

A board
Training data points
Input data points

Classification functions:

Calibrate(smooth) data to fit to graph
Find k nearest neighbors
Classify each datum

Placement functions:

Place training data
Place input data
OPTIONAL: place circle containing input’s k nearest neighbors

To create the board, simply located the element on the page to which you want to append:

var h = 300;var w = 300;var svgSelection = d3.select(document.getElementById(‘body’)).append(“svg”).attr(“width”, w).attr(“height”, h);

Don’t forget to add a border:

var border = svgSelection.append(‘rect’).attr(‘x’, 0).attr(‘y’, 0).attr(‘height’, h).attr(‘width’, w).style(“stroke”, ‘black’).style(“fill”, “none”).style(“stroke-width”, 4);

At the top of the js page, include the following variables:

h and w are the dimensions of the board, max is an object which holds the highest values for the each of the training variables, into addData, each data point will be pushed it so that future inputs includes training and input data, and the riskColors correspond with the various possible states of classification.

To figure out where to place the points on the graph, we need to perform several operations in order to fit the data’s range within the size of the graph. The helper functions findMax and smooth will be utilized in our placement functions. Lets build those out:

var findMax = (data) => {var results = {x: 0, y: 0};data.map(datum => {let x = datum.x;let y = datum.y;x > results.x ? results.x = x : false;y > results.y ? results.y = y : false;});return results;}

var smooth = (datum, max) => {datum.x = (datum.x/max.x) * h;datum.y = (datum.y/max.y) * h;return datum;}

Next, we need a function that will draw the points of data on the graph:

var allData = [];//global container for training and input datavar riskColors = [‘mediumturquoise’, ‘crimson’, ‘rebeccapurple’];//indices of riskColors correspond with a classification

var placeDatum = (datum, encircle) => {svgSelection.append('circle').attr('cx', datum.x).attr('cy', datum.y).attr('r', 5).attr('fill', riskColors[datum.identity])encircle ? drawHalo(datum) : false;//draw halo implemented below!}var max;var populateGraph = (data, adjust=true, encircle=false) => {max === undefined ? max = findMax(data) : max = max;data.map(datum => {allData.push(datum);adjust ? smooth(datum, max) : null;placeDatum(datum, encircle);});}

Encircle will be used set to true for input data, invoking drawHalo in the placeDatum, which draws a circle around the data point containing its k nearest neighbors. At this point, all that will be on the screen is a blank rectangle. All this needs now is a set of training data and to be invoked. At the bottom of the file, place:

//this may be replaced with any data in which there are//identified objects with the three properties found below

var trainingData = [{x: 32,y: 40, identity: 1},{x: 5,y: 27, identity: 0},{x: 17,y: 9, identity: 0},{x: 50, y: 29, identity: 1},{x: 22,y: 46, identity: 1},{x: 10,y: 27, identity: 0},{x: 5,y: 9, identity: 0}];

populateGraph(trainingData);

The dummy data should now render to the page, clustered and colored. This is still pretty boring, to make it interesting some interface for user interactivity needs to be created. In the body of the HTML, we’ll be adding a form with the id submitData which contains input bars for each of the variables:

<div id=”submitData”><form onsubmit=”handleSubmit(event)”><b>Add data:</b> <br/>x:<input id=”x” type=”text” placeholder=”CA15–3 in U/mL”/><br/>y:<input id=”y” type=”text” placeholder=”CA27–29 in U/mL”><button>Submit</button></form></div>

The handler should prevent the page from re-rendering, get the inputs from the page, correctly format them to an object to then be passed into a drawing function as a data point:

var handleSubmit = function(e) {e.preventDefault();var x = document.getElementById(‘x’).value;var y = document.getElementById(‘y’).value;var datum = {x: x,y: y,neighbors: [],identity: 0};classifyAndPlot([datum]);//classify and plot not yet defineddocument.getElementById(‘x’).value = ‘’;document.getElementById(‘y’).value = ‘’;//clear the input bars}

The formatted data then is passed into a function to categorize them based on the points closest to them, which is calculated by looping through all of the data points, calculating distance using good old Pythagoras’s formula. Lets all say it together: “a squared plus b squared equals c squared.” Good! Except this time, it’ll be √(xDiff²+yDiff²):

var findNeighbors = function(data, k) {return data.map(datum => {allData.map(neighbor => {var xDiff = Math.abs(datum.x — neighbor.x);var yDiff = Math.abs(datum.y — neighbor.y);var distance = Math.sqrt(Math.pow(xDiff, 2) + Math.pow(yDiff, 2));var currentFartherNeighbor = datum.neighbors.find(dataNeighbor => dataNeighbor[0] > distance)datum.neighbors.length < k ?datum.neighbors.push([distance, neighbor.identity]) :currentFartherNeighbor ?datum.neighbors.splice(datum.neighbors.indexOf(currentFartherNeighbor), 1, [distance, neighbor.identity]) :false;datum.neighbors.sort((a, b) => b[0] — a[0]);return datum;});});}

The above functions loops over the input data, and compares each datum to each of the allData array, pushing into the neighbors array if there are less then k neighbors or if a comparison yields a distance smaller than any currently in the array. In the case of the latter, the farthest neighbor also needs to be spliced out. Then the datum is returned with the neighbors array populated as expected. However, one more component is necessary in order to paint them on the page. We have functions to do the later which may be reused, but before passing data into them to render onto the page, we should consider what the input looks like when this function is invoked, and how to format the input to be successfully accepted by the populateGraph function.

var classifyAndPlot = (data) => {//fit the data to the graph rangedata.map(datum => smooth(datum, max));//populate neighborsfindNeighbors(data, 3);var formattedData = data.map(datum => {var neighborsIdentity = datum.neighbors.reduce((acc, neighbor) => {acc[neighbor[1]] === undefined ? acc[neighbor[1]] = 1 :acc[neighbor[1]] = acc[neighbor[1]] + 1;return acc;}, []);var maximum = Math.max(…neighborsIdentity.filter(neighbor => neighbor));datum.identity = neighborsIdentity.indexOf(maximum);return datum;});populateGraph(formattedData, false, true);}

That’s it! We have everything we need to train our algorithm and predict unclassified input results based on present clustering trends. While it has been noted that this is not the most efficient solution to most machine learning problems, but is useful in cases where there are discrete categories with clear distinctions. As someone interested in medicine and public health, I’ve learned that cancer diagnosis is one of those cases, in which you are assigned either a positive or negative diagnosis based on various biometric measures related to understood oncological trends. This is the idea that the example data is modeled after, but it can be modified for any training data in which each datum has been classified. What are some ideas for application that come to mind for you? Let me know below in the comments.

Thanks for reading, now go make your own improved implementations!

Jacob