There is a big chance that you heard about Neural Network and Artificial Intelligence in the course of the previous months, which seem to accomplish wonders, from rating your selfie to SIRI understanding your voice, beating players at games such as Chess and Go, turning a horse into a zebra or making you look younger, older, or from the other gender.
Those keywords have in fact been used so many times in so many different contexts, making this whole “A.I” lexical field so confusing that it does at best correlates with “something that looks smart”.
Maybe because of the confusion machine learning seems way too complex, way too hard to grasp, like “I bet there is so much math, this is not for me!!”.
Well, don’t worry, it was for me too, so let’s go on a journey, I’ll tell you everything I learned, some misconceptions I had, how to interpret the results, and some basic vocabulary and fun facts along the way.
What are we talking about?
Imagine a box in which you carve some holes then you throw in it a predefined amount of numbers, either
1. Then the box vibrates violently and from each hole the box spurts out one number:
Initially, your box is dumb as hell, it won’t magically give you the result you expect, you have to train it to achieve the goal you want.
Understanding through history
My biggest mistake was trying to wrap my head around concepts by just looking at the tip of the iceberg, playing with libraries and getting mad when it didn’t work. You can’t really afford that with neural nets.
The invention of this wonderful cardboard cheese originated around 1943 when the neurophysiologist Warren Sturgis McCulloch, 45 years, and his colleague Walter Pitts wrote a paper named: “A Logical Calculus of the Ideas Immanent in Nervous Activity”.
Pursuing the quest of the classical philosophers of Greece, he attempted to model how the brain works mathematically.
That was indeed pretty bright given since how long we knew about the neurons (~1900) and that the electric nature of their signals to communicate wouldn’t be demonstrated before the late 1950s.
Let’s take a second here to remember that published papers do not mean that the person who wrote it was absolutely right, it meant this guy: had a hypothesis, knew a bit in this field, had some kind of results, sometimes not applicable and published it.
Then other qualified people from the field have to try it, reproduce the results and decide whether or not it’s a nice base to build upon.
When left unchecked, and toyed with p-hacking, some papers can give birth to some aberrations like this 2011 paper saying that people could see in the future.
Nevertheless, we now know that our McCulloch was a good guy and in his paper he tried to model some algorithms after the physical brain, constituted itself of a nervous system (possessed by most multicellular animals), which is actually a net of neurons.
The human brain has ~86 billion of these fellas, each having axons and dendrites and synapses that connect each neuron to ~7000 others. Which is almost as many connections as the number of galaxies in the Universe.
Since we’re not good at visualizing big numbers, here is how to see $1B:
The problem for McCulloch was that the economic context wasn’t thriving in 1943: We’re at war, Franklin D Roosevelt froze the prices, salaries and wages to prevent inflation, Nikola Tesla passed away and the best computer available was the ENIAC, which cost $7 millions, for 30 tons (The sexism in tech ran pretty wild(er?) at that time and the fact that ENIAC was invented by 6 women caused their male colleagues to wrongly underestimate it).
As a comparison, a standard Samsung flip phone from 2005 had 1300x the ENIAC computing power.
Then in 1958, computers were doing a bit better, and Frank Rosenblatt, inspired by McCulloch, gifted us with the Perceptron.
Everybody was happy digging deeper until Marvin Minsky, 11 years later, decided he did not like that idea and that Frank Perceptron wasn’t cut for the job, as he explained by publishing a book in which he said: “Most of Rosenblatt writing … is without scientific value…” the impact of this book is that it drained the already low funding in this field.
What did Minsky have against McCulloch padawan?
The linearity conundrum
While this may sound like a Big Bang Theory episode title, it actually represents the basis of Minsky theory to detract from the original Perceptron.
Rosenblatt perceptron looks very similar to our box, in which this time we drilled a single hole:
If the sum of our inputs signals(x1…x4) multiplied by their respective weights (w1…w4) plus the bias (b) are enough to make the result gate go above the threshold (T), our door will liberate the value
For this to happen the threshold value is compared to the result of the activation function. Exactly like brain neurons respond to a stimulus. If the stimulus is too low the neuron doesn’t fire the signal along the axon to the dendrites.
Code wise it globally looks like this:
// Defining the inputs
const x1 = 1;
const x2 = 0.3;
const x3 = 0.2;
const x4 = 0.5;
// Defining the weights
const w1 = 1.5;
const w2 = 0.2;
const w3 = 1.1;
const w4 = 1.05;
const Threshold = 1;
const bias = 0.3;
// The value evaluated against the threshold is the sum of the
// inputs multiplied by their weights
const sumInputsWeights = x1*w1 + x2*w2 + x3*w3 + x4*w4; // 2.305
const doorWillOpen = activation(sumInputsWeights + bias) > Threshold; // true
In the human body a neuron electrical off state is at -70mV and its activation threshold is when it hits -55mV.
In the case of the original Perceptron, this activation is handled by the Heaviside step function.
One of the most known activation function is the sigmoid function:
f(x) = 1 / (1 + exp(-x)) and the bias is generally used to shift its activation threshold:
Some activations function allow negative values as output, some don’t. This will prove itself important when using results of the activation function of one perceptron to feed it as the inputs of another perceptron.
0 as an input always silences its associated weight, causing its connection to be non-existent in the sum.
So instead of having the possibility of weighting it down, it acts like an abstention vote. Whereas sometimes we might want
0as an input to be a vote for the opposite candidate.
The Perceptron is known as a binary classifier meaning it can only classify between 2 options (Spam vs Not Spam, Oranges vs Not-Oranges… etc)
It’s also designated as a linear classifier meaning its goal is to identify to which class an object belongs to according to its characteristics (or “features”: our x1 to x4)) by iterating until it finds a SINGLE line that correctly separates the entities from each class.
We give our classifier some examples of expected results given some inputs, and it will train itself to find this separation, by adjusting the weights assigned to every input and its bias.
As an example let’s classify some entities between “Friendly or not” according to 2 characteristics: Teeth and Size using a perceptron
Now that we trained our perceptron, we can predict samples it’s never seen before:
What Minsky reproached to Rosenblatt was the following: what if suddenly my training set contained a giant snake, with almost no teeth but quite as big as an elephant.
It now requires 2 lines to correctly separate the red entities from the green ones. Using a single perceptron this impossibility would cause the perceptron to try and run forever, unable to classify with a single straight line.
By this time you should be able to handle the usual representation of a perceptron:
Solving the bilinear issue:
One way to solve this is to simply train 2 perceptrons one responsible for the top left separation, the other for the bottom right, and we plug them together with an and rule, creating some kind of Multiclass perceptron.
const p1 = new Perceptron();
p1.learn(); // P1 ready.
const p2 = new Perceptron();
p2.learn(); // P2 ready.
const inputs = [x1,x2];
const result = p1.predict(inputs) & p2.predict(inputs);
The AND operation aka A^B is part of the logical operations a perceptron can perform, by putting w5 and w6 at around
0.6 and applying a bias of
-1 to our sum, we indeed created an AND predictor.
Remember that the inputs connected with w5 and w6 are coming from the activation of the Heaviside step function, yielding only
1 as output.
For 0 & 0 : (0*0.6 + 0*0.6)-1 is <0, Output: 0
For 0 & 1 : (0*0.6 + 1*0.6)-1 is <0, Output: 0
For 1 & 1 : (1*0.6 + 1*0.6)-1 is >0, Output: 1
By doing this chaining and having “elongated” our original perceptron we have created what is called a hidden layer, which is basically some neurons plugged inbetween the original inputs, and the final output.
The worst part is that Minsky knew all this and still decided to focus on the simplest version of the perceptron to spit on all of Rosenblatt’s work. Rosenblatt had already covered the multi-layer & cross-coupled perceptron in his own book Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms in 1962.
Which is a bit like publishing a book named Transistors, express that a single one of them is worthless and never acknowledge the computer.
But where is the perceptron line coming from?
You should all be familiar with the following equations:
y = mx + c
y = ax + b .
So how do we find this
c from our trained perceptron?
To figure this out we have to remember the original equation of our perceptron:
x1*w1 + x2*w2 + bias > T
Bias and Threshold are the same concepts, and for the perceptron
T = 0 which means the equation becomes
x1*w1 +x2*w2 > -bias which can be rewritten as:
x2 > (-w1/w2)*x1 + (-bias/w2) comparing this to:
y = m*x + b we can see that
Which means the gradient (m) of our line is determined by the 2 weights, and the place where the line cuts the vertical axis is determined by the bias and the 2nd weight.
We can now simply select two
x values (0 and 1), replace them in the line equation and find the corresponding
y values, then trace a line between the two points.
Like a good crime scene having access to the equation allows us to make some observations:
y = (-w1/w2)x + (-bias/w2)
· The gradient (steepness) of the line depends only on the two weights
· The minus sign in front of
w1 means that if both weights have the same sign, the line will slope down like a
\, if they are exclusively different it will slope up
/· Adjusting w1 will affect the steepness but not where it intersects the vertical axis, w2 instead will have an effect on both
· Because the bias is the numerator (top half of the fraction) increasing it will push the line higher up the graph. (will cut the vertical axis at a higher point)
You can check the final weights in the console of the demo by entering
And that is exactly what our perceptron tries to optimize until it finds the correct line
What do you mean by “optimize” ?
As you can guess, our perceptron can’t blindly guess and try every value for its weights until it finds a line that correctly separates the entities. What we have to apply is the delta rule.
It’s a learning rule for updating the weights associated with the inputs in a single layer neural network, and is represented by the following:
Don’t worry it can be “simplified” like this:
expected — actual represents an error value (or cost). The goal is to iterate through the training set and reduce this error/cost to the minimum, by adding or subtracting a small amount to the weights of each input until it validates all the training set expectations.
If after some iterations the error is
0 for every item in the training set . Our perceptron is trained, and our line equation using these final weights correctly separates in two.
Beware of the learning rate
In the equation above, α represents a constant: the learning rate, that will have an impact on how much each weight gets altered.
· If α is too small, it will require more iterations than needed to find the correct weights and you might get trapped in a local minima.
· If α is too big the learning might never find some correct weights.
One way to see it is imagining a poor guy with metal boots tied together that wants to reach a treasure at the bottom of a cliff but he can only move by jumping by α meters:
One thing you might want to do when reading an article on Wikipedia is to head on the “Talk” section which discusses disputed areas of the content.
In the case of the delta formula, the content said that it couldn’t be applied to the perceptron because the Heaviside derivative does not exist at
0 but the Talk section provided articles of M.I.T teachers using it.
By putting everything we learned we finally code a perceptron:
Additionally, the Perceptron goes into the feedforward neural network category which is just fancy wording for saying that the connections between the units do not form a cycle.
To cut some slack to Minsky, while the perceptron algorithm is guaranteed to converge on some solution in the case of a linearly separable training set, it may still pick any valid solution and some problems may admit many solutions of varying quality.
Your brain does a bit of the same as soon as it finds a correct way to talk to a muscle (ie: the muscle responds correctly,) it will settle for it. This brain-to-muscle code is different for everyone!
Intellectuals like to argue with each other all the time like Minksy and Rosenblatt did. Even Einstein was proven wrong in his fight against Niels Bohr on quantum indeterminism, arguing with his famous quote that: “God does not play dice with the universe”.
We’ll end with some poetry from the father of the neural networks our dear Warren McCulloch (he really was a poet).
I hope you learned some things and I will see you soon for the Part 2.
Our knowledge of the world, including ourselves, is incomplete as to space and indefinite as to time. This ignorance, implicit in all our brains, is the counterpart of the abstraction which renders our knowledge useful.