Using Neural Style Transfer to Transform Donald Trump’s “New York Skyline” into Art

There exist various opinions on Donald Trump. However, after having seen his infamous drawing of the New York skyline, the claim that he might be a great artist probably is none of them.

On the other hand, it has been shown by some researchers from Tübingen that neural networks can be used to create art. In this article, we are going to learn how this approach works both in theory and in practice. Along the process, we will not only see some fascinating pictures combining the style of famous artists with photographs of Donald Trump in various situations — we will also find out how Trump’s depiction of the New York skyline would have looked like if he had been working together with Pablo Picasso!

If you would like to see the complete code of this project all together, please refer to my github.

Together with Mrs. Merkel and Mrs. May at a Starry Night

Convolutional Neural Networks

To understand how style transfer with neural networks is being accomplished, it’s necessary to understand some basics of a special kind of neural nets — the so called Convolutional Neural Networks (CNNs). CNNs are the type of neural network that is most often applied to tasks involving image data. The reason for this is that CNNs perform an operation that is very useful when dealing with images: Each convolutional network layer is able to learn a set of filters.

Fig. 1: Horizontal Sobel Operator

In earlier decades, filters to process image data have exclusively been developed manually. Filters like these are still being used in common Computer Vision tasks, e.g. in robotics. These filters perform tasks that usually are relatively straightforward: E.g., there exist multiple hand-crafted filters for edge detection in images. One of those filters is the horizontal Sobel Operator (Figure 1). To understand how this filter detects horizontal edges in greyscale images, we can image this filter being slid over the image it is being applied to (although this is an oversimplification). At every step, the values of the nine image pixels corresponding to the current filter location are multiplied with the filter values, the results are being summed. Now what happens if there is no edge at the filter’s location? The grey values of all nine image pixels will be relatively similar, meaning the resulting value of the operator will be ≈0. But imagine the filter is applied to Vincent van Gogh’s Starry Night, and it partially covers a lighter part of the sky (high grey value) to the left and one of the dark trees (low grey value) to the right. Let’s say the sky’s pixels all have the value 100, and the trees pixels have the value 10. The resulting value when applying the Sobel operator will be 360, which is much higher than 0 and indicates that we have identified a horizontal edge.

Convolutional networks are able to learn filters performing operations of this kind all by themselves. The learnt filters identify features that help making sense of the content of an input image. A low-level convolutional layer might not only identify edges but also other simple structures like curves. But the great part is that it’s possible to stack convolutional layers. The network will then be able to learn more complex features in the higher convolutional layers.

Visualization of different filters learnt by the VGG-19 network, taken from lower (left) to higher (right) convolutional layers. (blog.keras.io)

Artistic Style Transfer with Convolutional Neural Networks

This means that it’s possible to capture the overall structure of an image by examining how different learnt filters in the lower and middle convolutional layers of the neural network react to the input. The filters of the higher layers are less concerned with specific low-level patterns, instead they are able to capture the large scale structure.

In other words: We can examine the style with the use of the lower and middle layers of our net, and we can examine the content with the use of a higher layer.

Combination with textile art from the Kasaï province, Democratic Republic of Congo

Why do we build on the VGG-19 network?

The VGG-19 network is a deep convolutional neural net designed by a team at the University of Oxford. This network has been trained to achieve an extremely high accuracy when classifying ImageNet objects, including all kinds of real world objects, like dogs, chairs, people, hammers, cars etc. The network has been confronted with a huge amount of data and has already learnt highly useful features in its convolutional layers, which we can use for our purposes as they have gratefully been made publicly available.

But how exactly are we going to approach the problem of transferring one images style to another image?

How neural style transfer works

When training a neural network, we firstly have to make a decision on the loss function we want to use. This loss function is the means by which we can tell our network which objective we want it to achieve.

Take a network that is being trained to distinguish dogs from cats: Our network’s loss should be low if the network is sure that it has seen a cat when it really has seen a cat, and it should be high when our network has seen a dog but is sure it has seen a cat. Usually, the neural network will adjust the calculations performed in its layers to become better at its task.

When performing style transfer, we take a different approach: the network itself will stay completely the same during the whole training process. In fact, using the VGG-19 network, it’s very unlikely that we would be able to considerably improve our net, having only very limited training data and computation resources. Instead, we allow the network to transform the input data, until the network, constrained by the loss functions, will be satisfied with its input. What’s still missing is a definition of these functions.

Drawing style of Hieronymus Bosch

Defining the loss functions for style transfer

Content Loss

We will have to define three cost functions, then our network’s objective will be to reduce the sum of their values. One of these functions is the content loss. We first take the original content image and one of the higher convolutional layers. Then we see how the layer responds to the original input (F). P denotes how the same layer responds to the currently generated image. We want the reaction to be similar, hence we take the squared difference as our loss. Using this cost function, the network can keep the content representation of the generated image similar to the content represented in the original image.

Style Loss

The second cost function we will use is the style loss. As has been mentioned, for measuring the style similarity, we will take into account the responses of multiple network layers. For each layer, we have a cost value E. Finally, to get our style loss, we calculate a total value, by applying a weighted sum. The authors of the original paper assigned the same weight to every chosen layer, but it’s possible to play around with these values to achieve unique results.

To obtain the style loss of a single layer E, the authors calculate and compare the gram matrices of these layers when confronted with the original style image (G) and with the current generated image (A). Intuitively, the gram matrix contains values which denote how likely it is that two features learnt by the convolutional layer appear together in an image. One benefit of comparing gram matrices is that this approach is independent of locations in the image, as we should assume that the style of the overall image does not differ between the regions of this image.

Total loss as defined in the original paper

The total loss as defined by the authors is a weighted sum of the content loss and the style loss. Usually, β is taken much larger than α (β≈100*α). To reduce clutter, we will also add a variation cost to the total loss in our implementation.

19th century Chinese art by Guan Lianchang chosen as style image. If you look closely, I claim you’ll be able to spot typically Chinese architectural elements.

Implementing neural style transfer with Tensorflow

As mentioned, you can find the complete style transfer code at my github.

I first downloaded the trained VGG-19 network from vlfeat.org. I rebuilt the model in tensorflow, only keeping the convolutional layers, as the layers at the end of the network only are being used for classification tasks, which we are not going to be concerned with. As proposed by the authors, I replaced the max pooling layers of the VGG network with average pooling layers, so as to achieve better results.

(Don’t worry if you are not familiar with the terminology. Pooling layers combine the values of neighbouring neurons in a convolutional layer, effectively reducing the number of neurons needed in the next convolutional layer.)

Applying Hieronymus Bosch’s style can lead to our network having pretty scary hallucinations.

I added an input variable to the network. Mostly when dealing with neural networks, we will have a placeholder for our inputs, which we use to feed different objects to the network at each training iteration. In our case, we will not change the input between consecutive training steps. We will give the network permission to do this.

<a href="https://medium.com/media/10cfafcf86cbe73e9a055ba392e28cc8/href">https://medium.com/media/10cfafcf86cbe73e9a055ba392e28cc8/href</a>

Also, I had to define functions for preprocessing images before providing them to the VGG network. Image values are being centred by their mean. For resizing them, I used SciPy’s imresize. For being able to apply convolutions, a new axis has to be introduced, meaning the image array will have 4 dimensions afterwards, where the first dimension’s size is equal to the number of images (which is 1).

<a href="https://medium.com/media/f6bc30c25d522e84ebd228fbe7fc2890/href">https://medium.com/media/f6bc30c25d522e84ebd228fbe7fc2890/href</a>

Now, I decided on the layers I wanted to take for finding the content and style features. I chose the same style layers like the authors did, but found a different content layer to provide better final results. However, you may want to play around with these values if you are going to implement style transfer yourself. Also, you should make sure to change the layer weights accordingly, matching your needs. Higher values for higher style layers will lead to more complex artistic features being transferred to your image (see the Trump-Bosch combination above). I saved the responses of the chosen layers when inputting the preprocessed content and style images to the VGG-19 net.

<a href="https://medium.com/media/b51507aeab1b4d514ac14c4e4d101f0a/href">https://medium.com/media/b51507aeab1b4d514ac14c4e4d101f0a/href</a>

Combination with 17th century Indian art

Now, I implemented the mentioned loss functions:

<a href="https://medium.com/media/a1dc9453618b0c3675352e59c7a7c6ed/href">https://medium.com/media/a1dc9453618b0c3675352e59c7a7c6ed/href</a>

… and I set up everything for being able to perform cost optimization.

<a href="https://medium.com/media/b6be51384f162060fba99d4622bd1b23/href">https://medium.com/media/b6be51384f162060fba99d4622bd1b23/href</a>

As network input, the authors of the paper proposed the use of white noise. However, to make it easier for the network to learn the desired content, I provided the net with a noisy representation of the content image as input.

<a href="https://medium.com/media/814f0f90c4cef8eaef3b269b0d44b916/href">https://medium.com/media/814f0f90c4cef8eaef3b269b0d44b916/href</a>

Finally, I trained the network for each of the content/style-combinations depicted above. Often, I was not satisfied with the initial results and had to adjust the available parameters — especially the layer weights, the loss weights α and β and the input image noise ratio. I performed the training on a powerful GPU at AWS, but if you have some hours, you will probably be able to create fascinating artwork even on your local CPU. Usually, you will find out quickly — within 200 iterations — whether the network is able to create a visually appealing result. The remaining iterations will improve the colors and refine the details. The images depicted in this article were the results of about 2000–3000 iterations each.

And what about Trump’s New York Skyline?

Trump’s original sketch has been sold for 30,000$ — then, how much would this masterpiece have made?

Applying the style of Picasso to Trump’s drawing. (Original content image by Nate D. Sanders Auctions)