The Full Story behind Convolutional Neural Networks and the Math Behind itby@joelbarmettlerUZH
7,197 reads
7,197 reads

The Full Story behind Convolutional Neural Networks and the Math Behind it

by Joel BarmettlerOctober 15th, 2019
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Convolutional Neural Networks became popular after 2010 because they outperformed any other network architecture on visual data. In this article, I aim to explain in very details how researchers came up with the idea of CNN, how they are structured, how the math behind them works and what techniques are applied to improve their performance. The biology behind CNN is inspired by the biology of the human visual system. CNN try to use this concept of combining low-level features in the image to higher and higher levels of features, until we have cells that react to very specific things: Fur, eyes, cat ears, cat eyes ets.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coins Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - The Full Story behind Convolutional Neural Networks and the Math Behind it
Joel Barmettler HackerNoon profile picture

Convolutional Neural Networks became really popular after 2010 because they outperformed any other network architecture on visual data, but the concept behind CNN is not new. In fact, it is very much inspired by the human visual system. In this article, I aim to explain in very details how researchers came up with the idea of CNN, how they are structured, how the math behind them works and what techniques are applied to improve their performance.

The biology behind it

The whole Idea of Convolutional Neural Networks is inspired by the biology of the eye. While we as humans perceive a visual image as a detailed, colored image of the world around us, there is actually quiet a lot of processing done in our brain to get to this point. In the retina, the back of our eye, we have quite primitiv Photoreceptors that only react to the amount of light falling on them. Some of them are wavelength selective, making them also responsive to color, others only react to the light intensity. More important for us is how these photoreceptors are structured in the eye. Interestingly, photoreceptors are wired together forming receptive fields: a group of photoreceptors reacting to light is surrounded by a group of photoreceptors reacting to darkness. All these photoreceptors are then connected to a ganglion cell. The ganglion cell itself only reacts when the majority of the photoreceptors connected to it become active. Therefore, the ganglion cell will therefore be highly reactive to round regions of contrast.

Later in the brain, in the regions of the LGN or Primary visual cortex, the ganglion cells line up and form straight lines. These aligned ganglion cells are again wired together and connected to a cell called "simple cell". Just like before, the simple cell only becomes reactive when the ganglion cell connected to it are all active. Since the galgion cells detected round contrasts and are now aligned next to each other, the simple cells detect - you guessed it - linear contrasts.

We can now play this game all day long: The higher we go into the brain, the more concrete features are detected by the cells. With combination of cells that can perceive more and more complex contrast patterns, the brain is able to form cells that react to very specific visual stimulation, like cells that respond when we see cats or dogs.

What researchers did with Convolutional Neural Networks is exactly the sam: CNN try to use this concept of combining low-level features in the image to higher and higher-level features, until we have cells that react to very specific things: Fur, eyes, cat ears ets. Then, we use a classic Neural Network to combine these features to a meaningful context: Two ears, fur and two eyes will with a high probability be a cat. You get the idea.

With our Convolutional Neural Network, we build exactly this process. The Convolutional Layers consist of Image Filters that extract patterns from the image. What information these filters extract is learned, just like in the brain. When we train a Convolutional Neural Layer, we try to generate the best possible filters, e.g. the filters that extract the most meaningful information.

Understanding image filters

Now, the best way to explain a convolutional filter more technically is to imagine a flashlight that is shining over the top left of our image with 500x500 pixels. Let’s say that the light this flashlight shines covers a 5x5 area, e.g. we have a filter of size 5x5. And now, lets imagine this flashlight sliding across all the areas of the input image. The flashlight is the filter (or kernel) and the region that it is shining over is called the receptive field of our convolutional neuron. This filter is just an array of numbers (the numbers are called weights or parameters). Let’s take the first position the filter is in for example. It would be the top left corner. As the filter is sliding, or convolving, around the input image, it is multiplying the values in the filter with the original pixel values of the image (aka computing element wise multiplications). These multiplications are all summed up. So now you have a single number. Remember, this number is just representative of when the filter is at the top left of the image. Now, we repeat this process for every possible location on the input volume. (Next step would be moving the filter to the right by 1 pixel, then right again by 1, and so on). Every unique location on the input volume produces a number. After sliding the filter over all the locations, you will find out that what you’re left with is a 496x496x1 array of numbers (1 since it is greyscale), which we call an activation map or feature map. The reason you get a 496x496 array is that there are 219‘961 different locations that a 5x5 filter can fit on a 500x500 input image. These 219‘961 numbers are mapped to a 496x496 array.

The following animation visualizes how such a kernel traverses an image with producing a new, filtered image. The kernel from the example has the size (3x3)-Pixels with values [[0 -1 0][-1 5 -1][0 -1 0]]. The filter from the animation, which I got from wikipedia, is used to sharpen an image. Note that there is an important difference between our filtering and the filtering technique in the animation: How to deal with border pixels. In the animation, the filter starts at the outermost pixel at position (0, 0). This has the consequence that the top-left part of the filter actually has no pixels to process. The way we deal with this issue is by starting not at the outermost pixel at coordiante (0, 0) but at the pixel at location (1, 1), such that a (3x3)-Filter can lay fully inside the image all the times. The animation uses a technique called mirroring: for these parts of the filter that lay outside of the original image, we act as if there was a mirrored version of the image. The way we use filters, the output image is slightly smaller than the input image, while the way the animation uses the image preserves its size. This is often done by image processing tools like Gimp or Photoshop: You usually dont want a smaller image when you apply a sharpening or blurring filter.

Each of these filters can be thought of as feature identifiers. When I say features, I’m talking about things like straight edges, simple colors, and curves. Think about the simplest characteristics that all images have in common with each other. Let’s say our first filter is 7x7 and is going to be a curve detector. As a curve detector, the filter will have a pixel structure in which there will be higher numerical values along the area that is a shape of a curve.

Now, let’s go back to visualizing this mathematically. When we have this filter at the top left corner of the input volume, it is computing multiplications between the filter and pixel values at that region. Now let’s take an example of an image that we want to classify, and let’s put our filter at the top left corner.

Basically, in the input image, if there is a shape that generally resembles the curve that this filter is representing, then all of the multiplications summed together will result in a large value! You can even see it by eye: The filter and the image region we look at are pretty overlapping, implying that the filter sees his features very well. When we move the filter to another position that does not contain the structure our filter is looking for, we generate a low response.

This is because there wasn’t much in the image section that responded to the curve detector filter. Remember, the output of this conv layer is an activation map. So, in the simple case of a one filter convolution (and if that filter is a curve detector), the activation map will show the areas in which there at mostly likely to be curves in the picture. In this example, the top left value of our 494x494 activation map (494x494 because of the 7x7 filter instead of 5x5) will be 6600. This high value means that it is likely that there is some sort of curve in the input volume that caused the filter to activate. The bottom right value in our activation map will be 0 because there wasn’t anything in the input volume that caused the filter to activate (or more simply said, there wasn’t a curve in that region of the original image). Remember, this is just for one filter. This is just a filter that is going to detect lines that curve outward and to the right. We can have other filters for lines that curve to the left or for straight edges. The more filters, the greater the depth of the activation map, and the more information we have about the input volume. In the following image, you see a visualization of what different filters might be looking for.

Low-level feature detection

Convolutional layers therefore detect low level features such as edges and curves. As one would imagine, in order to predict whether an image is a type of object, we need the network to be able to recognize higher level features such as hands or paws or ears. So let’s think about what the output of the network is after the first conv layer. It would be a 496x496 volume (assuming we use three 5x5 filters). When we go through another conv layer, the output of the first conv layer becomes the input of the 2nd conv layer. Now, this is a little bit harder to visualize. When we were talking about the first layer, the input was just the original image. However, when we’re talking about the 2nd conv layer, the input is the activation map(s) that result from the first layer. So each layer of the input is basically describing the locations in the original image for where certain low level features appear. Now when you apply a set of filters on top of that (pass it through the 2nd conv layer), the output will be activations that represent higher level features. Types of these features could be semicircles (combination of a curve and straight edge) or squares (combination of several straight edges). As you go through the network and go through more conv layers, you get activation maps that represent more and more complex features. By the end of the network, you may have some filters that activate when there is handwriting in the image, filters that activate when they see pink objects, etc.

High-level feature detection

Now that we can detect these high level features, we attach a fully connected layer to the end of the network. This layer basically takes an input volume (whatever the output is of the last convolutional layer happens to be) and outputs an N dimensional vector where N is the number of classes that the program has to choose from. Each number in this N dimensional vector represents the probability of our N classes. For example, if the resulting vector for a binary (N=2) classification program is [0.1 0.85], then this represents a 10% probability that the image is class A but a higher 85% probability that the image is class B. The way this fully connected layer works is that it looks at the output of the previous layer (which as we remember should represent the activation maps of high level features) and determines which features most correlate to a particular class. Basically, a fully connected layer looks at what high level features most strongly correlate to a particular class and has particular weights so that when you compute the products between the weights and the previous layer, you get the correct probabilities for the different classes.

So far, you already stumbled oppon the most important paramters you can set when definint the CNN:

  • How many convolutional layers we want
  • How many filters we want to train for each convolutional layer
  • How large these filters shall be
  • How many fully connected decision layers we would like
  • How many neurons each layer shall contain


While reading the part about Convolutional Layers, you may have noticed that we create quite a lot of images by running our input image through many different filters. Even though the images size will slightly decrease with each filter in each layer, we generally generate exponentially more and more data with each convolutional layer we add. To combat this issue, we use a process called MaxPooling: After filtering an image, we will reduce its size drastically by unifying a pixel neighborhood to one single value. Most prominently, we use MaxPooling, meaning we take the maximum pixel value of a pixel neighborhood, but we could also use other methods like MinPolling, AvgPolling or MedianPolling. The idea is always the same: Take a neighborhood of pixels, compare the value by some function (min, max, avg etc.) and create a single pixel out of these values. While this seems like a filter, it is quiet difference: A filter moves over our image sequentially with taking an pixel neighborhood as an input and producing a single pixel as an output, with shifting the filters by one pixels only, therefore not reducing the image size by much. Now with Polling, we shift the filter by multiple pixels such that each pixel is only seen by exactly one filter position. We can choose the size of our polling layer: A (2x2)-Polling region will reduce the size of our image by 4, a (3x3)-Polling layer by a factor of 9 and so on.

Understand the math behind the decision layers

Our Convolutional neural network really consists of two parts: Convolutional layers and fully connected decision layers. Both are connected using a flattening layer that converts an array of 2D images to a single 1D list of numeric values.

The Neural Network then applies its dark magic and comes to a decision whether the small filtered images are of class A or B. Well, it’s actually not that magical but more mathematical. While we do not need a deep understanding of how the decision taking in Fully connected neural network works, it is none the less quiet interesting and can help us developing a better understanding what happened under the hood when we call Keras to ‘train please, do your thing’.

Remember that a Neural Network is a typical case of supervised learning: We give the neural network a set of inputs and outputs, according to which it learns an optimal internal state. The network shall then be able to use this internal configuration to correctly predict new outputs on never seen inputs.

With explaining how Neural Networks work, I choose to focus on a simpler problem than the one we have in place here. This has two reasons. First, after several convolutional layers, our data is quiet abstract and it is hard to visualize what our fully connected layers are doing. It is much simpler to showcase Neural Networks using a simpler, more mathematical classification problem that we can actually visualize. Second, there exists a github repository with wonderfull animations that demonstrate learning and classification in Deep Neural Networks, which I will use here to underline my explenation.

The classification problem we focus here is the following: We have a dataset consisting of two dimensions: v1 and v2. Each datapoint belongs to one of two classes: red or blue. We therefore have a two-dimensional, binary classification problem. We can easily visualize our dataset by plotting all our datapoints into a 2D-Plane and coloring the points according to their label.

You can see that there exists a very clear distinction between our two classes. When we train a Neural Network, we want it to draw a clear line between the blue and red data. It can then make a prediction whether a new datapoint belongs to red or blue based on whether the datapoint lines inside or outside the decision boundary.

Even though this problem looks extremely simple for us humans, many traditional machine learning algorithms fail to perform such a task, since the seperation line is a circle. However, using a Neural Network with some hidden layers, we should be able to get to a good fit.


As a first step, we model a Neural Network. The input layer of our neural network clearly has to neurons, since we deal with two dimensional data: One input neuron receives the ‘v1’-value, the other the ‘v2’-value. The output of our neural network could either be just one neuron where the neurons value envodes its class (1 for blue, -1 for red) or two neurons, where the combination of the two makes the class ([0 1] for blue, [1 0] for red).

We can see that our machine learning model has to create a high level function of quiet some complexity, so we add several hidden layers into our network. The decision of how our neural network architecture should be is based on several factors:

  • The more difficult the problem is, the more hidden layers you should add
  • Try to keep the architecture simple, and the hidden layers minimal
  • When you want the network to combine features to new information, let it grow (e.g. add larger hidden layers)
  • If you want the network to reduce information and focus on the simple, let it shrink (e.g. add smaller hidden layers)
  • Look for papers that focus on problems similar to yours and copy their architecture
  • Try out different ones - there usualy is not recepie what will work for your specific problem, so feel free to try out new combinations and architectures
  • Use techniques like auto-keras to algorithmically try different network architectures and identify the optimal one

For our model, we use four hidden layers. This makes our network to have 6 layers in total, with four of them bein hidden layers. Note that a network is only called a “deep neural network” if it has at least one hidden layer - otherwise it would not be ‘deep’.

Now, to give you some motivation about what we are going to achieve, we will quickly model this Neural Network in keras and let it solve our binary classification problem. We use ‘relu’ activation functions for the hidden layers and a ‘sigmoid’ activation function for the output layer. We use an ‘adam’ optimizer to find the best configuration of our neural network and look on the ‘accuracy’ to determine its performance. Then, we let it train over 50 epochs before we stop and evaluate its results.

But don’t worry about that just yet, I have just brought it up so that you see how the keras code is written.

from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(4, input_dim=2,activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']), y_train, epochs=50, verbose=0)

Assuming X_train and y_train contain the data and labels of our training data, we are ready to train the neural network with just a few lines of keras code to nearly 100% accuracy.

Of course, this is not really shocking since our classification problem is simple. None the less, it shows the power of keras’ high level interface to Tensorflow. We create a sequential model (meaning data will flow from the left to the right sequentially), add our dense layers with specifying the activation function, compile the model with specifying the optimizer and metrics, and area ready to train (fit) the model to our data. Tadaa.

The mentioned github repo contains a wonderful animation of how our neural network fits to the data over multiple epochs:

The region in red shows all (v1, v2)-value pairs our Neural network predicted as a ‘red’ label, while the blue region are areas where the neural network would predict a ‘blue’ label. The colors in between indicate that the network is not sure about the label. In the animation, you see the epochs progress. While the network is unable to make any meaningful guess in the first epochs, it fits itself almost perfectly around our data over time / epochs.

The question of the day now is: How does it do that?

Single Neurons

Let’s quickly recap the structure of neural networks: As we have seen, neural networks consist of Neurons and Synapses. Each synapse of one layer is fully connected to all other synapses in the neighbouring layers to the left and to the right. This is why we speak of “fully connected layers”. Remember that we deal with a sequential model, meaning data only flows from the left to the right. This makes sense, since the input to the left is our data (v1, v2), and the output on the right is our classification value. Each neuron sends its information to all neurons of the layer to its right. Accordingly, each neuron receives a set of x-values (one from each neuron to its left) as an input and computes one single ^y value.

Each neuron has its own parameters, called weights and bias, also called w and b. The weights and bias is what makes the characteristic of a neuron. When the neuron receives m input values x1 to xm from the m neurons in the layer to its left, it calculates the weighted average of these input values. The weights correspond to the weights saved in the weight vector w, which has length m. For each input value, the neuron has saved a corresponding weight value in its weight vector w, which it multiplies with the input value x. Then, the neuron also adds its bias, a single number, to the weighted average. The resulting value, z = w’ * x + b, is then passed to an activation function - in our case relu. The output value of the single neuron is then ^y = g(z), or ^y = relu(w’ * x + b). Note that x and w have the same length, and w is a transposed vector, indicated by the transposed symbol ‘, meaning the multiplication of w with x results in a single value, to which we add the bias b.

The activation function is just a simple, non-linear function that decides whether the neuron outputs a value (^y > 0) or no value at all (^y = 0).

Neural Layers

When we train a Neural Network, we are modifying the weights and bias in a smart way such that the network all together maps inputs to correct outputs. One single neuron is nothing more than a weighted combination of the previous layer neurons. In order to be smart, the Neural Network has to have multiple neurons per layer that form different combinations of information.

This is why we stack multiple neurons together to form a neural layer. Since we now know the simplicity of how to calculate the output of one single neuron, we combine this information and show how the output values of one full layer is calculated. We use vectorization across a whole layer and combine the calculations done on a neuron level to matrix calculations on a layer level.

We call the l-th layer in our deep neural network as n[l], and the i-th neuron in that layer as n[l][i]. So the second neuron in the first layer would be n[0][1]. Note that we start with zero indices.

Note that one single neuron mapped an input x to its personal activation value ^y. Now that we deal with a whole layer, we need to combine the input vector x not to just one activation value, but to m activation values, with m being the number of neurons our layer contains. We call this the activation vector a of length m that represents the activation of all m neurons in our layer. Let us quickly rewrite the single neuron calculation based on this new notation:

We can write the same for all m neurons in layer 2:


Clearly, every neuron performs nearly the same operation. The only difference is that each neuron has its own weight vector and bias. We could of course use a for-loop to calculate the output vector z, or we take advantage that the GPU in our computer is insanely fast in calculating matrix multiplication and combine all our maths to one single matrix equation. This process is also called vectorization.

First of all, we put together the individual weight vectors to a weight matrix W (note the capital W). Similar, we stack all bias values together to a bias vector b. This allows us to have one single matrix multiplication to get to a output vector z containing the acivations for our whole layer.

This vectorization has already done us a big favor in speeding up the calculation process since GPUs are highly effective in calculating matrix multiplications in parallel. But let’s further speed up the training. We usually have a huge dataset on which we want to train our neural network on, often reaching tens of tousands of datapoints. Let’s say we bundle our data together to batches of size m. Now, instead of feeding the network a data vector a, we put together a matrix consisting of m vectors a and form a matrix A. We can then rewrite our previous equation with taking into account that we now not just feed a vector a but a matrix A:

This is generally referred to as the “batch_size”, a parameter you can choose when letting keras train a neural network. Note that if you choose the batch-size to be too large, the resulting matrices become too massive for the GPU to process, making the training process fail. It is good to try maximizing the batch-size, but you can quickly overshoot.

Activation function

Now that we have figured out the general math how the data flows through our neural network, let’s take some time and focus on the single part that was not yet taken into account: the role of the activation function. Well, we need the activation function to introduce non-linearity: Without an activation function, our neural network would just be a linear combination of input values, resulting in - you guessed it - a linear function itself. This has one significant downside: We can not fit non-linear functions. We therefore introduce activation functions that we apply to the linear combination to introduce greater flexibiltiy towards more complex functions.

The most common activation functions are ReLU, tanh and sigmoid. ReLU models quiet accurately the behaviour of synapses in our brain: Most of the time, they are not activated, and they need quiet some input weight to get activates themselves. But if a brain synapse is active, it fires its whole load into all via synapse connected neurons. ReLU behaves a bit different, since it does not fires as hard as it can when being activated, but proportionally to its input strength. This gives consecutive neurons more information about how good of a stimuly the neuron received.

Sigmoid is another important activation function that is mostly used for output layers: It is quiet binary, meaning mostly it outputs either 0 or 1, (binary classification), but can also output in-between values when the input stimuli is not clear enough - making sigmoid almost perfect for binary classification where we also want to get a feeling for ‘how certain’ our model is.

Loss & accuracy

Now, we have not actually talked about training the neural network yet. For now, our neural network is just a fancy way of randomly modifying an input value to some output value. Why random? Well, mostly we start with random weights as weight-vectors for each neuron, we really just randomly add the input weights together to some random output value. Which - how else could it be - produces some random output value. Not that intelligent yet.

Well, as I said, our goal is to somehow choose weights and biases that make the neural network fit the right input data to the right output data. To do so, we have to somehow optimize the weights and biases. And to do THAT, we need a function that determines how far away our networks prediction is from the optimal solution. This is exactly what our loss function indicates. It symbolizes how far our network is from the ideal solution. There are different methods of how to determine “the dinstance between the networks solution and the optimal solution”. The most intuitive loss function is simply loss = (Desired output — actual output). However this loss function returns positive values when the network undershoot (prediction < desired output), and negative values when the network overshoot (prediction > desired output). If we want the loss function to reflect an absolute error on the performance regardless if it’s overshooting or undershooting we can define it as: loss = Absolute value of (desired — actual ).

However, several situations can lead to the same total sum of errors: for instance, lot of small errors or few big errors can sum up exactly to the same total amount of error. Since we would like the prediction to work under any situation, it is more preferable to have a distribution of lot of small errors, rather than a few big ones. In order to encourage the network to converge to such situation, we can define the loss function to be the sum of squares of the absolute errors. This way, small errors are counted much less than large errors.

Besides the loss function, we also see the networks performance by the accuracy function. While the loss shows how far from the optimal solution we probably are, the accuracy is a simple function that lets our network predict the labels of our data and sees what percentage it got right. Usually, after one iteration, our model does quiet poorly, guessing just a little bit better than 50/50, resulting in an accuracy of 0.5. Over time, we expect accuracy to grow and loss to shrink.

But wait a minute: Shouldn’t loss and accuracy behave identify but mirrored - by the rate accuracy grows, loss falls? Well, kind of, but not really. The loss is calculated on training and validation / testing data. While the network has obviously seen the data it was trained on, it has never seen validation data before - at least not in the current epoch. Loss can be seen as the summation of errors a model has done on test and train data. It is therefore not a percentage like accuracy. When a network trains, it does not try to maximize accuracy, but to minimize loss. Loss indicates how well a network does after an iteration / epoch of optimization, accuracy can only be calculated after the model is done.

To get that clear: Showing a single data pair to the neural network during training will always lead to a little bit of loss, since the weights and biases of the network are not perfectly fitted to the input data (and they sould not be! we will later see why). But none the less, the weights and biases might be good enough to correctly predict the image, implying that there was no decrease in accuracy. Therefore, loss and accuracy must not be the same. While we might reach 100% accuracy, we can never reach 0% loss.

Minimizing Loss - Gradient Descent

Okay, we now know that our Neural Network want to minimize the error between the predicted value and the actual value - the loss. It can do so by modifying the weights and biases in all neurons. But how?

To achieve optimal weights and biases, we use a mathematical algorithm called ‘gradient descent’ to find the minimum of a multi-dimensional function. The dimensions are - in the case of a neural network - all the parameters we can tweak. To optimize one neural layer, we can tweak the layers weights and biases. As you might expect, there are quiet a lot of weights and biases for one single layer, and we therefore have a highly multi-dimensional optimization problem. The loss function now takes all these paramters as an input and outputs the loss - the value we want to minimize. Lucky for us, gradient descent is a highly optimized algorithm that finds the optimal paramters in order to minimize a function - in our case the loss function.

We start gradient descent making a guess where the optimum (minimum, in our case) might be. Then, we iteratively calculate the partial derivatives of our cost function at this position, and see whether we have reached an optimum yet. Remember that the partial derivatives give us the slope of the loss function towards each dimension, and we are only minimal if we have a positive slope in each dimension, e.g. we can not go down into any direction any further.

After calculating the derivatives, we not just find out if we are minimal yet, but also in which direction we should go to minimize further, namely into the direction of maximum negative slope. Non-mathematically speaking: into the direction of the steepest falling hill. We do a step into this direction of maximum slope, and start again: check whether we are minimal yet, if not we seek the route of fastes falling slopes and make a step into that direction. There are different configurations of the gradient decent algorithm, describing how to choose a good stepsize and when to stop looking for the minimum any further (since we can never fully reach it using this method, never).

Let’s say that, after some iterations, gradient descent actually found a minimum. In order to visualize these steps, I again stole a little animation from the mentioned github repository that shows a loss function that depends on two variables only and visualizes gradient descent over some iterations.

Note that we have absolutely no guarantee that gradient descent converges (reaches) a global minimum, e.g. the overall minimal point. It is far more likely that we end up in a local valley while at some other point in the landscape, there would be an even deeper one. We can minimize the risk of ending up in such a local minimum by making a good initial guess after roughly overlooking the landscape and estimating where the global minimum is going to be roughly.


Good, we can now find the optimal weights and biases for a single layer if we know the loss function. Now we use a technique called forward-backward propagation to check and optimize its performance.

Forward propagation is just a fancy term to what we have already done: Feed the network the input data and see what output data we get. Easy. We then calculate the loss - how far away the networks prediction is from the actual value. The problem is that, in order to optimize the overall network, we would need to combine all layers together to one, massive pile of parameters that form the loss. While this - in theory - can be done, it is not what we want. It would not just increase the complexity of the loss function but would also require that, for a big neural network with many layers, all parameters need to be optimized at once, limiting the neural network size that we can possibly optimize. If we could optimize layer by layer we can optimize any neural network, as long as we can fit the loss function for one layer onto our GPU (or RAM, if we work with a CPU cluster).

The only thing that prevents us from doing so is the fact that, if we only optimize the last layer towards the prediction output, all the previous layers are unaffected. We want the whole network to be optimized evenly.

Let’s call the aggregated, combined functions of the whole neural network our composition. This composition is what we want to optimize, and we want to optimize it on a per-layer basis. Luckily, derivatives are decomposable, meaning they can be back-propagated. We have the starting point of errors, which is the loss function, and we know how to derivate it, and if we know how to derivate each function from the composition, we can propagate back the error from the end to the start.

If we create a library of differentiable functions, or layers, where for each function we know how to forward-propagate (by directly applying the function) and how to back-propagate (by knowing the derivative of the function), we can compose any complex neural network. We only need to keep a stack of the function calls during the forward pass and their parameters, in order to know the way back to backpropagate the errors using the derivatives of these functions. This can be done by de-stacking through the function calls. This technique is called auto-differentiation, and requires only that each function is provided with the implementation of its derivative.

Now any layer can forward its results to many other layers, in this case, in order to do backpropagation, we sum the deltas coming from all the target layers. Thus our calculation stack can become a complex calculation graph.

Optimizers and Epochs

As we presented earlier, the derivative is just the rate of which the error changes relatively to the weight changes. For real-life problems we shouldn’t update the weights with such big steps. Since there are lot of non-linearities, any big change in weights will lead to a chaotic behaviour. We should not forget that the derivative is only local at the point where we are calculating the derivative.

Thus as a general rule of weight updates is the delta rule: New weight = old weight — Derivative Rate * learning rate. The learning rate is introduced as a constant (usually very small), in order to force the weight to get updated very smoothly and slowly (to avoid big steps and chaotic behaviour).

If the derivative rate is positive, it means that an increase in weight will increase the error, thus the new weight should be smaller.

If the derivative rate is negative, it means that an increase in weight will decrease the error, thus we need to increase the weights.

If the derivative is 0, it means that we are in a stable minimum. Thus, no update on the weights is needed -> we reached a stable state.

Now several weight update methods exist, called optimizers. Remember that we had chosen to use ‘adam’ as our optimizer when configuring keras.

Since we update the weights with a small delta step at a time, it will take several iterations in order to learn. These iterations are called epochs. In neural network, after each iteration, the gradient descent force updates the weights towards less and less global loss function.

How many epochs are needed to converge depends on the learning rate, the networks archtecture and the optimizer used. Optimizers might be faster but are more likely to result in chaos, or they are slow and stable, but might stand still in a position that is far from optimal.

It also depends on the random weight initialization of the network and the quality of the training data.


At different stages in my text, I mentioned the problem of overfitting without having stated why it occurs or how we can detect it. Since it is one of the main problems complex neural networks face, I want to take some time to adress overfitting in greater detail.

Basically, it describes a phenomenon where our network learns a mapping function for the data that is too specific. Remember, when we feed the neural network training data, we adjust its weights and biases according to the loss between the prediction and the actual output. If we train the network for too long too extensively, we might reach the point where neurons in the neural network fit the input data perfectly.

But why is that a problem? Well, we want to use our neural network to predict new outputs, that are similar to the input data - not the same. When we train too extensively, our network will fit the data so good that even slight variations in never seen input data can lead to a false prediction. When the network learned so hard that class A only occur when this one single pixel at the top-left corner has some greyscale value, a new input image might be classified wrong since the one pixel is missing. Clearly, we do not want the network to look that specific to details in the data but rather generalize what set of features mostly make up a certain class / label.

The problem of overfitting can be easily explained using an analogy of function approximation. In maths, we often deal with datapoints that we want to approximate using a function. Let’s say we have simple, two dimensional x-y datapoints and we seek for a function that describes the way this data behaves, allowing us to gather insights about the datas behaviour by analysing the approximation function. With such an easy sample, we can perfectly illustrate the effect of overfitting. Let’s see at the following data:

You see the datapoints in blue. The orange and yellow lines are two different types of polynomial interpolations: the yellow line is a quiet low order polynomial, a relatively simple function. It might be a polynomial of order 5 or 7, for example. In contrast, the orange line is a very high order polynomial, maybe of order 27.

You see, when we calculate the least squared error between the interpolation function and the data, we quickly see that the orange function fits our data better - it is generally closer to the blue datapoints than the yellow line. But it does in fact not bring us more value - it fits the data too much, showing local maxima and minima all over the place where we can find none in the data itself. So although we might think that a higher order polynomial that fits the data closer is generally better, it is actually not, since we can not use it to detect trends or perform calculations on the function to gather insights about the behavior of the underlying data.

To bring the same analogy to neural networks, let’s consider a dataset similar to the one I introduced in the beginning of the chapter: red and blue dots on a Cartesian coordinate system. But this time, the data slightly overlaps, making it nearly impossible for the neural network to make a very clear distinction what regions shall result in blue or red labels.

If we not train our neural network over many epochs on this data, we see that it tires to fit the training data too specific, forming really strange shapes that do not describe our overall data at all. Just because it happens to be that some bllue points are forming a small group among the reds does not mean that the network should fit around it.

In general, we can identify overfitting by looking at the cross-validation set, so by testing the networks performance on data it has never ever seen before. But this can become difficult in some situations.

So our best guess it to use methods to prevent overfitting in general, and there are different methods to do so.

L1 and L2 Regularizations

The first thing we do to prevent overfitting is regularization. A regularizer adds an extra element to the loss function that punishes the model for being too complex, e.g. having too high values in the networks weight matrix. This indeed limits the flexibility of our network but also ensures that single neurons become too important for the classification.

Two popular versions of this method are L1 - Least Absolute Deviations (LAD) and L2 - Least Square Errors (LS). In most cases the use of L1 is preferable, because it reduces the weight values of less important features to zero, very often eliminating them completely from the calculations. In a way, it is a built-in mechanism for automatic featur selection. Moreover, L2 does not perform very well on datasets with a large number of outliers. The use of value squares results in the model minimizing the impact of outliers at the expense of more popular examples.

Regularizers take a regularization factor lambda that changes the impact of the values contained in the weights matrix. We can also call it the regularization rate. Choosing the lambda factor is hard, since a lambda that is too large simplifies our data to a extend that makes the model unable to fit our data, while a value too small simply does not regluarize enough and therefore does not prevent overfitting.

Let’s have a look at the impact of different lambas on our loss function using L1 and L2 regularization with different lambda.

We can immediately notice that the planes obtained for the model without regulation, and models with a very low λ coefficient value are very “turbulent”. There are many peaks with significant values. After applying L2 regularization with higher hyperparameter value, the graph is flattening. Finally, we can see that setting the lambda value around 0.1 or 1 causes a drastic decrease in the value of weights in our model.


A surprisingly effective method to prevent overfitting in fully connected decision layers is dropping-out neurons at random: With disabling certain neurons randomly, meaning they will never output their action potential, we force the network to always rely on different neurons. We force the network to train itself evenly across all neurons, therefore generalizing the data that leads to a prediction. In our case, we have set the dropout rate to 25%. Having a higher dropout rate can be beneficial since it further reduces the risk of overfitting, but it may also be dangerous: Each neuron does not see the input image while being dropped-out, so a too-high dropout rate prevents our neurons to be trained effectively, and certain network structures can not be formed since simply too few neurons are present to form a decision.

Mathematically speaking, since in each iteration, any input value can be randomly eliminated, the neuron tries to balance the risk and not to favour any of the features. As a result, the values in the weight matrix become more evenly distributed. The model wants to avoid a situation in which the solution it proposes, will no longer make sense, because it no longer has information flowing from an inactive feature.

Early stopping 

The obvious. If we train for too long, we fit the data too well. When we closely monitor the models accuracy on the train- and testset, we can see that after too many iterations, the test accuracy might start to decline while the train accuracy rises. This is a direct implication that - despite all counter measurements, we start to overfitt. Its time to stop training if we see such kind of behaviour.

Now you have a fairly good understanding of any aspects of Convolutional Neural Networks. If you have questions, feel free to post them in the comment section, I am happy to help!