What Are Convolution Neural Networks? [ELI5] by@NKumar

October 26th 2019 1,782 reads

Universal Approximation Theorem says that Feed-Forward Neural Network (also known as **Multi-layered Network of Neurons**) can act as powerful approximation to learn the non-linear relationship between the input and output. But the problem with the Feed-Forward Neural Network is that the network is prone to over-fitting due to the presence of many parameters within the network to learn.

Can we have another type of neural network that can learn complex non-linear relationship but with fewer parameters and hence prone to over-fitting?. **Convolution Neural Network** (CNN) is another type of neural network that can be used to enable machines to visualize things and perform tasks such as image classification, image recognition, object detection, instance segmentation etc…are some of the most common areas where CNN’s are used.

In this article, we will explore the workings of the Convolution Neural Network in-depth. This article is broadly divided into two parts:

In part one, we will discuss how convolution operation works across different inputs — 1D, 2D, and 3D inputs.In the second part, we will explore the background of Convolution Neural Network and how they compare with Feed-Forward Neural Network. After that, we will discuss the key concepts of CNN’s.

Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — PadhAI.

Let’s start with the basics. In this section, we will understand what is convolution operation and what it actually does!.

Imagine there is an aircraft that takes off from Arignar Anna International Airport (Chennai, India) and going towards Indira Gandhi International Airport (New Delhi, India). By using some instrument you are measuring the speed of the aircraft at regular intervals. In a typical scenario, our equipment might not be 100% accurate so instead of relying on the most recent reading from the equipment we might want to take an average of the past values.

Since these readings are taken at different time steps instead of taking a simple average of the readings, we want to give more importance to the most recent reading than the previous readings i.e… assign more weight to the current reading and lesser weights to the older readings.

Let’s say the weight assigned to the reading at the current time step is w₀ and weight for the previous reading is w₋₁ so on. The weights are assigned in the decreasing order. From a mathematical standpoint, imagine that we have infinite readings for the aircraft and at every time step we have weights assigned to that time step all the way up to infinity. Then the speed at the current time step is given by (Sₜ),

The weighted sum of all values

Let’s take a simple example an see how we would be able to calculate the value of the reading at the current time step by using the above formula.

As you can see from the above figure, we would like to calculate the speed of the aircraft at the current time step (t₀). The actual speed we got from the instrument is 1.90 which we don’t trust so we will calculate a new value by taking the weighted average of the previous readings along with their weights. The new reading we have obtained from this operation is 1.80.

In the convolution operation we are given a set of inputs and we calculate the value of the current input based on all its previous inputs and their weights.

In this example, I haven’t talked about how we obtain these weights whether these weights are right or wrong. For now, just focus on how the convolution operation works.

In the previous section, we have seen how the convolution operation works in 1D input. In a nutshell, we are re-estimating the value at the particular input as a weighted average of inputs around it. Let’s discuss how we can apply the same concept for 2D inputs.

For ease of explanation, I am considering the above-shown image as a greyscale image with only one color channel. Imagine each pixel present in the above image as a reading taken by a camera. If we want to re-estimate the value of any pixel, we can take the value of its neighbors and compute the weighted average of these neighbors to get the re-estimated value. The mathematical formula for this operation is given by,

K —Matrix that represents the weights assigned to pixel values. It has two indices a,b — a denotes rows and b denotes columns.

I — Matrix containing the input pixel values.

Sᵢⱼ — The re-estimated value of a pixel at a location.

Let’s take an example to understand how the formula works. Imagine that we have an image of TajMahal and 3x3 weight matrix (also known as Kernel). In convolution operation, we impose the kernel on the image such that the pixel of interest would be aligned with the center of the kernel and then we will compute the weighted average of all neighborhood pixels. Then we will slide the kernel from left to right till it passes the entire width and then top to bottom to compute the weighted average of all the pixels present in the image. The convolution operation would look like this,

Consider that we are going to re-estimate the pixel value in an image using a 3x3 identity kernel matrix. The way we do that is by going over each pixel present in our image systematically and place the kernel such that the pixel is at the center of the kernel. Then re-estimate the value of that pixel as the weighted sum of all its neighbors.

In this operation, we are taking an average of 9 neighbors including the pixel itself. Because of it, the resultant image would be blurred or smoothen out.

How would be performing the convolution operation in the case of 3D input?

So far whatever images that we have seen are 3D images because of 3 input channels — Red, Green, and Blue (RGB) but we have ignored that just for sake of explanation. However, in this section, we will consider the image in its original form i.e… 3D inputs. In a 3D image, every pixel will have 3 values, separate channels for red, green, and blues color values.

In a 2D input, we are sliding the kernel (which is also 2D) in the both horizontal and vertical direction. In the 3D input we will use a 3D kernel that means the depth of the image and kernel is same. There is no movement of kernel along with the depth since both kernel and image are of same depth.

Similar to the 2D convolution operation, we will slide the kernel in the horizontal direction. Every time we move the kernel we are taking the weighted average of the entire 3D neighborhood i.e… weighted neighborhood of RGB values. Since we are sliding the kernel in only two dimensions — left to right and top to bottom the output from this operation will be 2D output.

Even though our input is 3D, the kernel is 3D but the convolution operation that we are performing is 2D that’s because the depth of the filter is the same as the depth of the input.

In practice instead of applying one kernel, we can apply multiple kernels with different values on the same image one after another so that we can get multiple outputs.

All of these outputs can be stacked on top of each other combined to form a volume. If we apply three filters on the input we will get an output of depth equal to 3. Depth of the output from the convolution operation is equal to the number of filters that are being applied on the input.

So far in the previous sections, we have learned how the convolution operation works in different input scenarios. In this section, we will discuss how do we compute the dimensions of the output after the convolution operation?.

Consider that we have 2D input of size 7x7 and we are applying a filter of 3x3 on the image starting from the top left corner of the image. As we slide the kernel over the image from left to right and top to bottom it's very clear that the output is smaller than the input i.e… 5x5.

Why is the output is smaller?

Since we can’t place the kernel at the corners as it will cross the input boundary. The value of those pixels outside the image are undefined so we don’t know how can we compute the weighted average of pixels in that area.

For every pixel in the input, we are not computing the weighted average and re-estimating the pixel value. This is true for all the shaded pixels present in the image (at least with 3x3 kernel), hence the size of the output will be reduced. This operation is known as **Valid Padding**.

What if we want the output to be the same size as the input?

The size of the original input was 7x7 and we also want the output size to be 7x7. So in that case what we can do is that we can add an artificial pad evenly around the input with zeros such that we would be able to place the kernel **K** (3x3) on the corner pixels and compute the weighted average of neighbors.

By adding this artificial padding around the input we are able to maintain the shape of the output as same as the input. If we have a bigger kernel (**K** 5x5) then the amount of pad we need to apply also increases such that we would be able to maintain the same output size. In this process, the size of the output is the same as the size of the output hence the name **Same Padding (P)**.

So far, we have seen in the images that we are sliding the kernel (filter) from left to right with a certain interval until it passes the width of the image. Then we are sliding from top to bottom till the entire image transverses. **Stride** **(S)** defines the interval at which the filter is applied. By choosing the stride(interval) more than 1 we are skipping a few pixels when we are computing the weighted average of the neighbors. Higher the stride smaller the size of the output image.

If we combine the things we learned in this section into a mathematical formula, that can help us to find the width and depth of the output image. The formulae would look like this,

Finally, coming to the depth of the output if we apply ‘**K**’ filters on our input we would get ‘**K**’ such 2D outputs. Hence the depth of the output is the same as the number of filters.

How did we arrive at Convolution Neural Networks?

Before we discuss the Convolution Neural Networks, let's travel back in time and understand how image classification was done in pre-deep learning era. That also acts as a motivation for why we prefer Convolution Neural Networks for Computer Vision.

Let’s take the task of image classification, where we need to classify the given image into one of the classes. The earlier method of achieving this to flatten the image i.e… image of 30x30x3 is flattened into a vector of 2700 and feed this vector into any of the machine learning classifiers like SVM, Naive Bayes, etc…The key takeaway in this method is that we are feeding the raw pixels as the input to the Machine Learning algorithms and learning the parameters of the classifiers for image classification.

After that, people started to realize that not all the information present in the image is important for image classification. In this method, instead of passing the raw pixels into the classifiers we pre-process the image by applying some pre-defined or handcrafted filters (eg. applying the edge detector filter on the image) and then pass the pre-processed representation to classifiers.

Once feature engineering (pre-processing the image) started to give better results, improved algorithms like SIFT/HOG have been developed to generate a refined representation for the image. The feature representation generated by these methods is static i.e… there is no learning involved in generating the representation, all the learning is pushed to the classifier.

Instead of manually generating the feature representation of an image. Why not flatten the image into a vector of 2700x1 and pass it into the **Feed-Forward Neural Network** or Multi-layered Network of Neurons (MLN) so that the network can learn the feature representation also?

Unlike static methods like SIFT/HOG, Edge detector, we are not fixing the weights but we are allowing the network to learn through the back-propagation such the overall loss of the network reduces. Feed-Forward Neural nets can learn a single feature representation of the image but in the case of complex images, neural nets will fail to give better predictions because it can’t learn pixel dependencies present in the images.

Convolution Neural Network can learn multiple layers of feature representations of an image by applying different filters/transformations such that it can be able to preserve the Spatial and Temporal pixel dependencies present in the image. In CNN’s the number of parameters for the network to learn is significantly lower than the MLN due to Sparse connectivity and Sharing of weights in the network allows CNN’s to transfer faster.

In section, We will understand the difference between Feed-Forward Neural Network and Convolution Neural Network with respect to Sparse Connectivity and Weight Sharing.

Consider that we are performing a task of digit recognition and our input is of 16 pixels. In the case of Feed-Forward Neural Network, each neuron present in the input/hidden layer is connected to all the outputs from the previous layer i.e… it is taking a weighted average of all the inputs connected to that neuron.

In Convolution Neural Network, by superimposing the kernel over the image we are considering only a few inputs at a time to compute the weighted average of selected pixel inputs. The output h₁₁ is calculated using much sparser connections rather than considering all the connections.

Remember when we are trying to compute the output h₁₁ we have considered only 4 inputs and similarly for the output h₁₂. One important point to note is that we are using the same 2x2 kernel to calculate the h₁₁ and h₁₂ i.e… the same weights are being used to compute the outputs. Unlike the Feed-Forward Neural Network where each neuron present in the hidden layer will have separate weights for itself. This phenomenon of utilizing the same weights across the input to compute the weighted average is called **Weight Sharing.**

Consider that we have an input volume with length, width, and depth of 3 channels. When we apply a filter of the same depth to the input we would get a 2D output also known as feature map of the input. Once we have got the feature map typically we will perform an operation called **Pooling operation**. Since the number of hidden layers required to learn the complex relations present in the image would be huge. we apply pooling operation to reduce the input feature representation thereby reducing the computational power required for the network.

Once we obtain the feature map of the input, we will apply a filter of determined shape across the feature map to get the maximum value from that portion of the feature map. This is known as **Max Pooling**. It is also known as **Sub-Sampling** because from the entire portion of the feature map covered by kernel we are sampling one single maximum value.

Similar to Max Pooling, **Average Pooling** computes the average value of the feature map covered by the kernel.

Once we have done a series of convolution and pooling operation (either max pooling or average pooling) on the feature representation of the image. We will flatten the output of the final pooling layer into a vector and pass that through the Fully Connected layers (Feed-Forward Neural Network) with varying number of hidden layers to learn the non-linear complexities present with the feature representation.

Finally, the output of the Fully Connected layers is passed through a Softmax layer of the desired size. Softmax layer outputs a vector of probability distributions which helps to perform the task of image classification. In the problem of digit recognizer (shown above) the output softmax layer has 10 neurons to classify the input into one of the 10 classes (0–9 digits).

In this post, we have discussed how convolution operation works across different inputs and then we went on to discuss some of the primitive methods of image classification leading up to CNN’s. After that, we discussed the working of CNN’s also learned a few important technical aspects of the Convolution Neural Networks. Finally, we looked at the reason behind attaching MLN’s at the end of CNN’s to learn complex relations to solve the problem of image classification.

If you want to learn more about Artificial Neural Networks using Keras & Tensorflow 2.0 (Python or R). Check out the Artificial Neural Networks by Abhishek and Pukhraj from Starttechacademy. They explain the fundamentals of deep learning in a simplistic manner.

In my next post, we will discuss how to visualize the workings of Convolution Neural Network using Pytorch.

Until then Peace :)

NK.

You connect with me on LinkedIn or follow me on twitter for updates about upcoming articles on deep learning and Artificial Intelligence.