Fábio Perez


Reading the VGG Network Paper and Implementing It From Scratch with Keras

There are hundreds of code examples for Keras. It's common to just copy-and-paste code without knowing what's really happening. In this tutorial, you will implement something very simple, but with several learning benefits: you will implement the VGG network with Keras, from scratch, by reading the VGG's* original paper.

* I'm using the term "VGG" to describe the architecture created by VGG (Visual Geometry Group, University of Oxford) for the ILSVRC-2014.

Well, what is the point of implementing something that is already implemented?

The point is to learn. By completing this guide, you will:

  • Learn more about the architecture of VGG;
  • Learn more about convolutional neural networks;
  • Learn more about how to implement networks in Keras;
  • Learn more about the scientific method by reading a scientific paper and implementing part of it.

Why start with VGG?

  • It's easy to implement;
  • It achieved excellent results on the ILSVRC-2014 (ImageNet competition);
  • The paper is nice to read;
  • There is a Keras implementation of it, so you can compare your code.

Target audience

People who are new to Deep Learning and have never implemented any network with Keras.


Basic Python, and basic convolutional neural networks knowledge. I recommend that you read the Stanford’s CS231n: Convolutional Neural Networks for Visual Recognition notes.

Exercise 0

Skim through the “VGG” network paper: Very Deep Convolutional Networks for Large-Scale Image Recognition. Take a better look on the results and the architecture of the network. Based on the results, choose which configuration you want to implement.

Exercise 1

Study the network architecture. Consider the following hyper-parameters: convolutional filter (receptive field) size, stride and padding. Also, check which activation function is used. If you are not sure what these terms mean, check the CS231n class notes. Don't forget to check the size of input data.

Calculate the number of the parameters that each layer has to learn. Sum the parameters from each layer to get the total number of learnable parameters. These Convolutional Networks notes from CS231n can help a lot. Don’t forget to sum the biases. Also, calculate the shape (width, height, depth) of the output of each layer. Use pencil and paper! Drawing helps a lot in this exercise.

You can find information on the number of parameters and how to calculate them on the paper.

Exercise 2

Read the first page of the Keras documentation and Getting started with the Keras Sequential model. Skip the Examples section before your first trial*. Go back to the paper and read it with more attention. Focus on the architecture configuration. Start coding your network architecture. You will need to go through the Layers section of Keras documentation.

* Note: I’m only recommending that you skip the examples section so you don’t get “spoilers” from the VGG network. It’s always recommended that you read the examples as you’ll probably learn more from them than from the rest of the documentation.


  • Read section 2 (ConvNet Configurations) of the paper.
  • Don’t forget the Keras includes:
    For example, if you want to use keras.layers.pooling.MaxPooling2D, import as: from keras.layers.pooling import MaxPooling2D. This will make the code more readable.
  • If you get stuck, take a look at the examples from the Keras documentation.

Exercise 3

Compare your results with the Keras implementation of VGG. Check if the number of parameters of your network is the same as Keras’. You can use model.summary() to show the number of parameters and the output shape of each layer in your network.

Getting the solutions

For this section, I'll focus more on the process of getting the solutions rather than on the solutions themselves.

Exercise 0

For the first exercise, I did the first thing I do every time I start reading a paper: read the abstract, read the conclusion, then skim through it looking for interesting results (that usually are tables and figures).

The results shown in Tables 3 and 4 indicate that the best network configurations are D and E. The architectures of these configurations are shown in Table 1. Note that you don't need to read the entire paper to find this information as everything you need (for now) could be easily found by taking a quick look at images and tables.

I decided to implement configuration D as it has practically the same performance of configuration E, but with a simpler architecture (16 convnets instead of 19).

Exercise 1

We want to understand the network architecture. From our first exercise, we now that the different configurations are described in Table 1. In the table description, we can understand that conv3–64 is a convolutional layer with a receptive field of size 3x3 and 64 channels (filters):

The convolutional layer parameters are denoted as "conv⟨receptive field size⟩-⟨number of channels⟩"

However, the table doesn't tell anything regarding the convolutional padding and stride. To find these, we again skim through the paper.

Taking notes on the paper help you organize your ideas better. It does matter if you print the paper and use a pen or if you do it digitally, but always take notes.

Now we know everything about the network architecture:

  • input size: 224 x 224;
  • the receptive field size is 3 x 3;
  • the convolution stride is 1 pixel;
  • the padding is 1 (for receptive field of 3 x 3) so we keep the same spatial resolution;
  • the max pooling is 2 x 2 with stride of 2 pixels;
  • there are two fully connected layers with 4096 units each;
  • the last layer is a softmax classification layer with 1000 units (representing the 1000 ImageNet classes);
  • the activation function is the ReLU

We can now calculate the number of learnable parameters.

You can find this information in Section 2.3 (Discussion).

For the first convolutional layer, the network has to learn 64 filters with size 3x3 along the input depth (3). Plus, each one of the 64 filters has a bias, so the total number of parameters is 64*3*3*3 + 64 = 1792. You can apply the same logic for other convolutional layers.

The depth of a layer’s output will be the number of its convolutional filters. The padding is chosen as 1 pixel so the spatial resolution is preserved through the convolutional layers. Thus, the spatial resolution will only change at the pooling layers. So, the output of the first convolutional layer will be 224 x 224 x 64.

The pooling layer does not learn anything, so we have 0 learnable parameters in it. To calculate the output shape of the pooling layer, we have to consider the size of the window and the stride. Since the window is 2 x 2 and the stride is 2, the layer is outputting a pixel for every 2 x 2 pixels and jumping by 2 pixels to do the next calculation (no overlap occurs), so the spatial resolution is divided by 2 in each pooling layer. The depth remains the same.

To calculate the number of parameters in the fully-connected layers, we have to multiply the number of units in the previous layer by the number of units in the current layer. By following the logic presented in the previous paragraphs, you can see that the number of units in the last convolutional layer will be 7x7x512. So, the total number of parameters in the first fully-connected layer will be 7x7x512x4096 + 4096 = 102764544 for configuration D.

If can compare the total number of parameters with the results on the paper (Table 2):

Exercise 2

In the first page of Keras documentation, you can find that you will need to create a Sequential model:

from keras.models import Sequential
model = Sequential()

And add layers with model.add() . An alternative way is to pass a list of layers to the Sequential constructor (I used this method).

The hardest part is to define the exact parameters for each layer. This can be done by looking at the documentation: Convolutional, Pooling, and Core layers.

We first define which layers we will use. Since the VGG network works with images, we will use Conv2D and MaxPooling2D. It's important to read the entire documentation on these layer types.

For the Conv2D layers, the first thing we note is that:

When using this layer as the first layer in a model, provide the keyword argument input_shape (tuple of integers, does not include the sample axis), e.g. input_shape=(128, 128, 3) for 128x128 RGB pictures in data_format="channels_last".

So, we have to define the input_shape of the images. From exercise 2, we noted that the input size is 224x224. We are working with color images, so the depth of our input is 3.

By reading the Conv2D arguments, we learn how to define the size of the kernels, the stride, the padding and the activation function.

An important argument to note is the data_format. It's used to define the order of the data flow in Keras. Since I don't want to set this argument for every program in Keras, I edit the ~/keras/keras.json to set a default:

"image_data_format": "channels_last"
# (...) other configs

The arguments that we need to use are filters, kernel_size, strides, padding, and activation. Some other arguments can be useful if you intend to train the model later, such as kernel_initializer.

Setting the filters, the kernel size and the stride is trivial. Setting the activation function requires that you go to the Activations documentation.

There are two options for the padding: valid or same. It's not clear what they mean, so I had to Google it. I found this and this. An alternative would be to look directly in the Keras implementation.

With border mode “valid” you get an output that is smaller than the input because the convolution is only computed where the input and the filter fully overlap.
With border mode “same” you get an output that is the “same” size as the input. That means that the filter has to go outside the bounds of the input by “filter size / 2” — the area outside of the input is normally padded with zeros.

So, we want the padding to be set as same.

For the MaxPooling2D layer, we need to set the stride, the pooling size and the padding. Since we want a padding of zero, we just use padding="valid". As valid is the default value for padding, we can omit this argument (but note that Keras API change a lot, so this can cause a change in the architecture for future versions of Keras).

Before we call the fully-connected (Dense layers), we need to flatten the output of the last convolutional networks. This will just reshape the 3D output of the convnet to 1D.

Finally, for the Dense layers, we just have to set the number of units and the activation function.

The final code is very concise:

Note that I didn't include the Dropout layers nor I set the the weights initializers as we are still not interested in the training step.

You can find the Keras' implementation of VGG here.

Exercise 3

You can check the VGG16 or VGG19 architecture by running:

from keras.applications import VGG16, VGG19

Go beyond

An interesting next step would be to train the VGG16. However, training the ImageNet is much more complicated task. The VGG paper states that:

On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.

That's a lot of time even if you have a setup of thousands of dollars.

Nonetheless, an interesting exercise would be to try to see how you would reproduce some interesting aspects of the training and the testing setup, such as including the Dropout layers, setting the optimizer, compiling the model, playing with pre-processing etc.

You can also try to perform a fine-tuning on the top of VGG by using pre-trained weights. I intend to create a tutorial on this topic soon.

Plus, you can read the Inception network and try to implement it.

Topics of interest

More Related Stories