Associative Machine Learning: Predicting Cancer Risk by@jmpenney22

July 29th 2017 761 reads

At its heart, the k Nearest Neighbor technique treats the problems it receives as issues of pattern recognition in order to properly classify data. It does this by βlearningβ the graphical space in which the different categories reside; implicit is the assumption that each classes will be clustered distinctly. Computers learn similarly to living things, based on as much previous data as is available, predictions are made in order to do work more efficiently.

Typically, Javascript has not been thought of as a good language for the implementation of machine learning algorithms, and for good reason. When first developed, the language was incredibly inefficient and a misunderstanding of underlying concepts or blatant disregard for best practices led to a proliferation of terrible libraries and examples. Further, the community had been leaning towards OOP, but there has slowly been a movement to a more functional style. Over the years optimizations have been made, libraries have been refactored or newly created in ways that lend to performing complex processing, the underlying language implementation has had several iterations, and computing machines are also equipped with much greater processing capabilities.

All of this leads to an interesting space in which we, as Javascript developers, find ourselves. On the cusp of a rapidly transforming widespread technology, the possible applications of which are expanding, we have the opportunity to contribute to and improve a technology which is becoming increasingly fundamental to our daily experience. D3 has been one of the libraries most widely used with the goal of data visualization, and is depended upon several widespread visualization tools today; this is what will be used to plot out k Nearest Neighbors algorithm. If you need an introduction to the library before applying it to something more involved, an earlier post of mine introducing it might of use.

Now that weβre all excited about our role in this field, lets explore a simple implementation of arguably the most fundamental machine learning algorithm, k Nearest Neighbor. While on the less complex side of predictive computation algorithms, it is widely used in the data analysis industry, and is one of the best performing for several problems, one such example is the prediction of behavior between proteins, or protein-protein interaction (PPI):

βIn spite of its simplicity to implement, when a large data set or numerous features are used the computational cost and memory requirement grows rapidly. This method has been used (not widely) in PPI prediction problem.β -Zahiri, Bozorgmehr, Masoudi-Nejad, 2013

To illustrate this, lets explore the example pictured to the left, the repo for which can be cloned down and explored, or you can play with it in this CodePen! Note that the dummy data is dumb; as should be inferred, they are not from a valid scientific study nor do the predictions represent a medical diagnosis in reality. It is loosely based on the findings of this medical study, and related published research outlining the correlation. which indicates that high levels of the proteins CA15β3 and CA27β29 could indicate the presence of a tumor in the breast; it should also be noted that these tracers also get spiked in cases of ovarian, liver, and other breast diseases. Units are calculated in units per milliliter of blood, with levels over 30U/mL considered high-risk. The training data can be changed in the source code, and the graph will handle the new data by fitting it to the graph as long as it is in the correct format: {x: num, y: num, neighbors: array, identity: num}.

To create this graph, there will be a series of functional tasks to accomplish, each of which my be solved differently, and modularity may be implemented in the functions to various degrees or with a different opinions with regards to the ideal separation of concerns. The following is my own opinion on how to best organize the program. The pieces we will address as individual problems in the construction of our larger program include:

- A board
- Training data points
- Input data points

- Calibrate(smooth) data to fit to graph
- Find k nearest neighbors
- Classify each datum

- Place training data
- Place input data
- OPTIONAL: place circle containing inputβs k nearest neighbors

To create the board, simply located the element on the page to which you want to append:

varh= 300;

varw= 300;

varsvgSelection= d3.select(document.getElementById(βbodyβ))

.append(βsvgβ)

.attr(βwidthβ, w)

.attr(βheightβ, h);

Donβt forget to add a border:

varborder= svgSelection.append(βrectβ)

.attr(βxβ, 0)

.attr(βyβ, 0)

.attr(βheightβ, h)

.attr(βwidthβ, w)

.style(βstrokeβ, βblackβ)

.style(βfillβ, βnoneβ)

.style(βstroke-widthβ, 4);

At the top of the js page, include the following variables:

`h`

and `w`

are the dimensions of the board, `max`

is an object which holds the highest values for the each of the training variables, into `addData`

, each data point will be pushed it so that future inputs includes training and input data, and the `riskColors`

correspond with the various possible states of classification.

To figure out where to place the points on the graph, we need to perform several operations in order to fit the dataβs range within the size of the graph. The helper functions `findMax`

and `smooth`

will be utilized in our placement functions. Lets build those out:

varfindMax= (data) => {

var results = {x: 0, y: 0};

data.map(datum => {

let x = datum.x;

let y = datum.y;

x > results.x ? results.x = x : false;

y > results.y ? results.y = y : false;

});

return results;

}

varsmooth= (datum, max) => {

datum.x = (datum.x/max.x) * h;

datum.y = (datum.y/max.y) * h;

return datum;

}

Next, we need a function that will draw the points of data on the graph:

varallData= [];//global container for training and input data

varriskColors= [βmediumturquoiseβ, βcrimsonβ, βrebeccapurpleβ];

//indices of riskColors correspond with a classification

varplaceDatum= (datum, encircle) => {

svgSelection.append('circle')

.attr('cx', datum.x)

.attr('cy', datum.y)

.attr('r', 5)

.attr('fill', riskColors[datum.identity])

encircle ? drawHalo(datum) : false;

//draw halo implemented below!

}

varmax;

varpopulateGraph= (data, adjust=true, encircle=false) => {

max === undefined ? max = findMax(data) : max = max;

data.map(datum => {

allData.push(datum);

adjust ? smooth(datum, max) : null;

placeDatum(datum, encircle);

});

}

`Encircle`

will be used set to true for input data, invoking `drawHalo`

in the `placeDatum`

, which draws a circle around the data point containing its `k`

nearest neighbors. At this point, all that will be on the screen is a blank rectangle. All this needs now is a set of training data and to be invoked. At the bottom of the file, place:

//this may be replaced with any data in which there are

//identified objects with the three properties found below

vartrainingData= [{x: 32,y: 40, identity: 1},

{x: 5,y: 27, identity: 0},

{x: 17,y: 9, identity: 0},

{x: 50, y: 29, identity: 1},

{x: 22,y: 46, identity: 1},

{x: 10,y: 27, identity: 0},

{x: 5,y: 9, identity: 0}];

populateGraph(trainingData);

The dummy data should now render to the page, clustered and colored. This is still pretty boring, to make it interesting some interface for user interactivity needs to be created. In the body of the HTML, weβll be adding a form with the id `submitData`

which contains input bars for each of the variables:

<div id=βsubmitDataβ>

<form onsubmit=βhandleSubmit(event)β>

<b>Add data:</b> <br/>

x:<input id=βxβ type=βtextβ placeholder=βCA15β3 in U/mLβ/><br/>

y:<input id=βyβ type=βtextβ placeholder=βCA27β29 in U/mLβ>

<button>Submit</button>

</form>

</div>

The handler should prevent the page from re-rendering, get the inputs from the page, correctly format them to an object to then be passed into a drawing function as a data point:

varhandleSubmit= function(e) {

e.preventDefault();

var x = document.getElementById(βxβ).value;

var y = document.getElementById(βyβ).value;

vardatum= {

x: x,

y: y,

neighbors: [],

identity: 0

};

classifyAndPlot([datum]);

//classify and plot not yet defined

document.getElementById(βxβ).value = ββ;

document.getElementById(βyβ).value = ββ;

//clear the input bars

}

The formatted data then is passed into a function to categorize them based on the points closest to them, which is calculated by looping through all of the data points, calculating distance using good old Pythagorasβs formula. Lets all say it together: βa squared plus b squared equals c squared.β Good! Except this time, itβll be **β**(xDiffΒ²+yDiffΒ²):

varfindNeighbors= function(data, k) {

return data.map(datum => {

allData.map(neighbor => {

var xDiff = Math.abs(datum.x β neighbor.x);

var yDiff = Math.abs(datum.y β neighbor.y);

var distance = Math.sqrt(Math.pow(xDiff, 2) + Math.pow(yDiff, 2));

var currentFartherNeighbor = datum.neighbors.find(dataNeighbor => dataNeighbor[0] > distance)

datum.neighbors.length < k ?

datum.neighbors.push([distance, neighbor.identity]) :

currentFartherNeighbor ?datum.neighbors.splice(datum.neighbors.indexOf(currentFartherNeighbor), 1, [distance, neighbor.identity]) :

false;

datum.neighbors.sort((a, b) => b[0] β a[0]);

return datum;

});

});

}

The above functions loops over the input data, and compares each datum to each of the `allData`

array, pushing into the neighbors array if there are less then `k`

neighbors or if a comparison yields a distance smaller than any currently in the array. In the case of the latter, the farthest neighbor also needs to be spliced out. Then the datum is returned with the neighbors array populated as expected. However, one more component is necessary in order to paint them on the page. We have functions to do the later which may be reused, but before passing data into them to render onto the page, we should consider what the input looks like when this function is invoked, and how to format the input to be successfully accepted by the `populateGraph`

function.

varclassifyAndPlot= (data) => {

//fit the data to the graph range

data.map(datum => smooth(datum, max));

//populate neighbors

findNeighbors(data, 3);

var formattedData = data.map(datum => {

var neighborsIdentity = datum.neighbors.reduce((acc, neighbor) => {

acc[neighbor[1]] === undefined ? acc[neighbor[1]] = 1 :

acc[neighbor[1]] = acc[neighbor[1]] + 1;

return acc;

}, []);

var maximum = Math.max(β¦neighborsIdentity.filter(neighbor => neighbor));

datum.identity = neighborsIdentity.indexOf(maximum);

return datum;

});

populateGraph(formattedData, false, true);

}

Thatβs it! We have everything we need to train our algorithm and predict unclassified input results based on present clustering trends. While it has been noted that this is not the most efficient solution to most machine learning problems, but is useful in cases where there are discrete categories with clear distinctions. As someone interested in medicine and public health, Iβve learned that cancer diagnosis is one of those cases, in which you are assigned either a positive or negative diagnosis based on various biometric measures related to understood oncological trends. This is the idea that the example data is modeled after, but it can be modified for any training data in which each datum has been classified. What are some ideas for application that come to mind for you? Let me know below in the comments.

Thanks for reading, now go make your own improved implementations!

Jacob

Join Hacker Noon

Create your free account to unlock your custom reading experience.