Snakes surviving thanks to machine learning: here There is a big chance that you heard about Neural Network and Artificial Intelligence in the course of the previous months, which seem to accomplish wonders, from to understanding your voice, beating players at games such as Chess and , or making you look younger, older, or . rating your selfie SIRI Go turning a horse into a zebra from the other gender Those keywords have in fact been used so many times in so many different contexts, making this whole “A.I” lexical field so confusing that it does at best correlates with “something that looks smart”. Maybe because of the confusion machine learning seems way too complex, way too hard to grasp, like “I bet there is so much math, this is not for me!!”. Well, don’t worry, it was for me too, so let’s go on a , I’ll tell you everything I learned, some misconceptions I had, how to interpret the results, and some basic vocabulary and fun facts along the way. journey What are we talking about? Imagine a in which you carve some holes then you throw in it a either or Then the box vibrates violently and from each hole the box spurts out one number: or . Great. box predefined amount of numbers, 0 1 . 0 1 Now what? Initially, your box as hell, it won’t magically give you the result you expect, you have to it to achieve the you want. is dumb train goal Understanding through history My biggest mistake was trying to wrap my head around concepts by just looking at the tip of the iceberg, playing with libraries and getting mad when it didn’t work. You can’t really afford that with neural nets. Time to go back The invention of this wonderful cardboard cheese originated around 1943 when the neurophysiologist , 45 years, and his colleague Walter Pitts wrote a paper named: “ ”. Warren Sturgis McCulloch A Logical Calculus of the Ideas Immanent in Nervous Activity Pursuing the quest of the classical philosophers of Greece, he attempted to mathematically. model how the brain works That was indeed pretty bright given since (~1900) and that the electric nature of their signals to communicate wouldn’t be demonstrated before the late 1950s. how long we knew about the neurons Let’s take a second here to remember that published papers do not mean that the person who wrote it was absolutely , it meant this guy: , knew a bit in this field, had some kind of results, sometimes not applicable and published it. Then other qualified people from the field have to try it, and decide whether or not it’s a nice base to build upon. right had a hypothesis reproduce the results When left unchecked, and toyed with , some papers can give birth to some aberrations like this 2011 paper saying that . p-hacking people could see in the future Nevertheless, we now know that our was a good guy and in his paper he tried to model some algorithms after the physical brain, constituted itself of a (possessed by most multicellular animals), which is actually a net of McCulloch nervous system neurons. Hoping you won’t lose too many of these reading the following. ( ) Source The human brain has ~ of these fellas, each having and that connect neuron to others**.** Which is almost as many connections as the Since we’re not good at visualizing big numbers, here is how to see $1B: 86 billion axons dendrites and synapses each ~7000 number of galaxies in the Universe . Now imagine 86 times that The problem for McCulloch was that the economic context wasn’t thriving in 1943: We’re at war, Franklin D Roosevelt froze the prices, salaries and wages to prevent inflation, Nikola Tesla passed away and the best computer available was the , which cost $7 millions, for (The sexism in tech ran pretty wild(er?) at that time and the fact that ENIAC was caused their male colleagues to wrongly underestimate it).As a comparison, a standard Samsung flip phone from 2005 had the ENIAC computing power**.** ENIAC 30 tons invented by 6 women 1300x Then in computers were doing a bit better**,** and , inspired by McCulloch, gifted us with the . 1958, Frank Rosenblatt Perceptron Everybody was happy digging deeper until 11 years later, decided he and that Frank Perceptron wasn’t cut for the job, as he explained by publishing a book in which he said: “ ” the impact of this book is that it drained the already low funding in this field. Marvin Minsky, did not like that idea Most of Rosenblatt writing … is without scientific value… Classic Minsky. What did Minsky have against McCulloch padawan? The linearity conundrum While this may sound like a episode title, it actually represents the basis of Minsky theory to detract from the original Perceptron. Big Bang Theory Rosenblatt perceptron looks very similar to our box, in which this time we : drilled a single hole A Neural Network can actually take inputs 0 and 1 between If the sum of our multiplied by their respective plus the are enough to make the gate go above the , our door will liberate the value otherwise, . inputs signals(x1…x4) weights (w1…w4) bias (b) result threshold (T) 1 0 For this to happen the threshold value is compared to the result of the Exactly like brain neurons respond to a . If the stimulus is too low the neuron doesn’t fire the signal along the axon to the dendrites. activation function . stimulus Depolarization of a neuron thanks to the magic of the sodium-potassium pump Code wise it globally looks like this: // Defining the inputsconst x1 = 1;const x2 = 0.3;const x3 = 0.2;const x4 = 0.5; // Defining the weightsconst w1 = 1.5;const w2 = 0.2;const w3 = 1.1;const w4 = 1.05; const Threshold = 1;const bias = 0.3; // The value evaluated against the threshold is the sum of the// inputs multiplied by their weights// (1*1.5)+(.3*0.2)+(.2*1.1)+(.5*1.05) const sumInputsWeights = x1*w1 + x2*w2 + x3*w3 + x4*w4; // 2.305const doorWillOpen = activation(sumInputsWeights + bias) > Threshold; // true In the human body a neuron electrical off state is at -70mV and its activation threshold is when it hits -55mV. In the case of the original Perceptron, this activation is handled by the . Heaviside step function if x is negative, if x is null or positive. x being the sumInputsWeights+bias. 0 1 One of the most known activation function is the and the is generally used to : sigmoid function: **f(x) = 1 / (1 + exp(-x))** bias shift its activation threshold Tweaking the activation function can yield to adjusted results altering when the neuron fires Some activations function allow negative values as output, some don’t. This will prove itself important when using results of the activation function of one perceptron to feed it as the inputs of another perceptron. Using as an input always silences its associated weight, causing its connection to be non-existent in the sum. 0 So instead of having the possibility of weighting it down, it acts like an abstention vote. Whereas sometimes we might want as an input to be a vote for the opposite candidate. 0 Many other activations functions exist The Perceptron is known as a meaning it can only classify between **options (**Spam vs Not Spam, Oranges vs Not-Oranges… etc) binary classifier 2 It’s also designated as a meaning its goal is to to according to its (or “features”: our )) by iterating until it finds that correctly the entities from each class**.** linear classifier identify to which class an object belongs characteristics x1 to x4 a SINGLE line separates We give our classifier some examples of expected results given some inputs, and it will to find this separation, by adjusting the assigned to every input and its . train itself weights bias As an example let’s classify some entities between “ ” according to 2 characteristics: and using a perceptron Friendly or not Teeth Size Depending on the sources the cat seems very borderline indeed Now that we trained our perceptron, we can predict samples it’s never seen before: Head here for the live demo . No seriously, check it. What Minsky reproached to Rosenblatt was the following: what if suddenly my training set contained a , with almost no teeth but quite as big as an elephant. giant snake The training set is not separable by one line anymore It now requires lines to correctly separate the red entities from the green ones. Using a single perceptron this impossibility would cause the perceptron to try and run forever, unable to classify with a . 2 single straight line By this time you should be able to handle the usual representation of a perceptron: The bias can be applied either after the sum or as an added weight for a fictional input being always 1 Solving the bilinear issue: One way to solve this is to simply train 2 perceptrons one responsible for the top left separation, the other for the bottom right, and we with an rule, creating some kind of . plug them together and Multiclass perceptron const p1 = new Perceptron();p1.train(trainingSetTopLeft);p1.learn(); // P1 ready. const p2 = new Perceptron();p2.train(trainingSetBottomRight);p2.learn(); // P2 ready. const inputs = [x1,x2]; const result = p1.predict(inputs) & p2.predict(inputs); There are other ways to solve this that we’ll explore in part 2 The AND operation aka A^B is part of the logical operations a perceptron can perform, by putting and at around and applying a bias of to our sum, we indeed created an AND predictor. Remember that the inputs connected with w5 and w6 are coming from the activation of the Heaviside step function, yielding only or as output. w5 w6 0.6 -1 0 1 For 0 & 0 : (0*0.6 + 0*0.6)-1 is <0, Output: 0For 0 & 1 : (0*0.6 + 1*0.6)-1 is <0, Output: 0For 1 & 1 : (1*0.6 + 1*0.6)-1 is >0, Output: 1 By doing this chaining and having “elongated” our original perceptron we have created what is called a which is basically some neurons plugged inbetween the original inputs, and the final output. hidden layer, They should really be called “ ” or “sub problem layers” feature detectors layers The worst part is that and still decided to focus on the simplest version of the perceptron to spit on all of Rosenblatt’s work. Rosenblatt had already covered the multi-layer & cross-coupled perceptron in his own book in 1962. Minsky knew all this Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms Which is a bit like publishing a book named express that a single one of them is worthless and never acknowledge the computer. Transistors, But where is the perceptron line coming from? You should all be familiar with the following equations: or . So how do we find this and from our trained perceptron? y=f(x) y = mx + c y = ax + b m c To figure this out we have to remember the original equation of our perceptron: x1*w1 + x2*w2 + bias > T Bias and Threshold are the , and for the perceptron which means the equation becomes which can be rewritten as: comparing this to: we can see that same concepts T = 0 x1*w1 +x2*w2 > -bias x2 > (-w1/w2)*x1 + (-bias/w2) y = m*x + b stands for for for and for y x2 x x1 m (-w1/w2) b (-bias/w2) Which means the of our line is determined by the 2 weights, and the is determined by the bias and the 2nd weight. gradient (m) place where the line cuts the vertical axis We can now simply select two values (0 and 1), replace them in the line equation and find the corresponding values, then trace a line between the two points. and . x y (0;f(0)) (1;f(1)) Like a good crime scene having access to the equation allows us to make some observations: y = (-w1/w2)x + (-bias/w2) · The of the line depends only on the two weights· The sign in front of means that if both weights have the same sign, the line will slope down like a , if they are exclusively different it will slope up · Adjusting will affect the steepness but not where it intersects the vertical axis, will have an effect on **both**· Because the (top half of the fraction) increasing it will push the line higher up the graph. (will cut the vertical axis at a higher point) gradient (steepness) minus w1 \ / w1 w2 instead bias is the numerator You can check the final weights in the console of the by entering demo app.perceptron.weights And that is exactly what our perceptron tries to optimize until it finds the correct line I know it might look like 2 lines but it’s really one moving super fast What do you mean by “optimize” ? As you can guess, our perceptron can’t blindly guess and try every value for its weights until it finds a line that correctly separates the entities. What we have to apply is the delta rule . It’s a learning rule for updating the weights associated with the inputs in a single layer neural network, and is represented by the following: Oh god math! Don’t worry it can be “simplified” like this: We do this for every item in the training set. Doing represents an value (or cost). The goal is to iterate through the training set and reduce this error/cost to the minimum, by adding or subtracting a small amount to the weights of each input until it validates all the training set expectations. expected — actual error If after some iterations the is for every item in the training set . Our perceptron is trained, and our line equation using these final weights correctly separates in two. error **0** Beware of the learning rate In the equation above, represents a constant: the learning rate, that will have an impact on how much each weight gets altered. α · If is too it will require to find the correct weights and you might get trapped in a local minima. · If is too the learning might some correct weights. α small, more iterations than needed α big never find One way to see it is imagining a poor guy with tied together that wants to reach a treasure at the bottom of a cliff but he can only move by jumping by meters: metal boots α too big α too small α One thing you might want to do when reading an article on Wikipedia is to head on the “ ” section which discusses disputed areas of the content. Talk In the case of the delta formula, the said that it couldn’t be applied to the perceptron because the Heaviside derivative does not exist at but the provided articles of M.I.T teachers using it. content 0 Talk section By putting everything we learned we finally code a perceptron: Additionally, the Perceptron goes into the category which is just fancy wording for saying that the connections between the units do not form a cycle. feedforward neural network To cut some slack to Minsky, while the perceptron algorithm is guaranteed to converge on solution in the case of a linearly separable training set, it may still pick solution and some problems may admit many solutions of varying quality. some any valid Your brain does a bit of the same as soon as it finds a correct way to talk to a muscle (ie: the muscle responds correctly,) it will settle for it. This brain-to-muscle code is different for everyone! Intellectuals like to argue with each other all the time like Minksy and Rosenblatt did. Even was proven wrong in his fight against on , arguing with his famous quote that: “ ”. Einstein Niels Bohr quantum indeterminism God does not play dice with the universe We’ll end with some poetry from the father of the neural networks our dear Warren McCulloch ( ).I hope you learned some things and I will see you soon for the Part 2. he really was a poet Our knowledge of the world, including ourselves, is incomplete as to space and indefinite as to time. This ignorance, implicit in all our brains, is the counterpart of the abstraction which renders our knowledge useful.