Deep learning has supercharged ‘cognitive’ tasks such as vision and language processing. Even Google has switched to neural network based translation. One possible reason is that it does not require domain specific knowledge to obtain state of the art results. Also, immensly parallelised hardware like GPUs , coupled with well designed frameworks like TensorFlow , have given rise to the AI Revolution. This post talks about another such ‘cognitive’ task: Coloring black and white photos using deep learning.
This article is based on a fairly recent paper : https://arxiv.org/pdf/1603.08511.pdf
This articles assumes basic knowledge about neural networks and loss functions.
The task is fairly simple: Take a black and white photo , and produce a coloured version of it. Intuitively , the idea is straightforward. Depending on what is in the picture, it is possible to tell what the color should be. E.g. The leaves of trees are generally green, the sky is blue, clouds are white, etc. All that is needed to be done is to make a computer be able to do this.
Previous works have used deep learning. They used regression to predict the colour of each pixel. This , however, produces fairly bland and dull results.
Previous works used Mean Squared Error (MSE) as the loss function to train the model. The authors noted that MSE will try to ‘average’ out the colors in order to get the least average error, which will result in a bland look. The authors instead pose the task of colorising pictures as a classification problem.
The authors used the LAB colour space (the most common color space is RGB). In the LAB scheme, the L channel records the light intensity value, and the other two channels record the color opponents green–red and blue–yellow respectively. You can read about LAB in detail here.
One good reason to use LAB color space is that it keeps the light intensity values separate. B/W pictures can be considered to be just the L channel, and the model wont have to learn how to keep light intensities right when it makes predictions (it will have to do that if RGB is used). The model will only learn how to colour images, allowing it to focus at what matters.
The model outputs the AB values, which can then be applied to the B/W image to get the coloured version.
The model itself is a fairly standard convolutional neural network. The authors did not use any pooling layers, and instead chose to use upsampling/downsampling layers.
As briefly mentioned above, the authors used a classification model instead of a regression one. Therefore, the number of classes need to be fixed. The authors chose 313 AB pairs as the number of classes. Even though this may seem like a very low value, they used methods to ensure more color values are possible (which will be discussed later in this post).
The Loss function that the authors used was the standard Cross Entropy. Z is the actual class of a pixel, while Z hat is the output of the model.
The authors also noted that there would be class imbalances for the colour values. Cross entropy is a loss function that does not play very well with class imbalances, and usually classes that have fewer examples are given a higher weight. The authors noted that desaturated colors like gray and light blue are abundant compared to others, because of their appearance in backgrounds. Therefore they came up with their weighing scheme.
The authors calculate ~p , which is the distribution of classes, from the ImageNet database. Remember that Q is the number of classes (313). The authors used λ value of .5 worked well. Note that the authors smoothened the distribution ~p , but I will skip the details here. If you are interested, you can read it in the original paper.
So after taking into account the weight, the final loss function looks like:
The new term v() is just the value of the weight for each of the classes. h and w are the height and width of the image , respectively.
Using a class number of 313 directly to color images would be too coarse. There are simply too few colours to realistically represent the real range of colors.
The authors used a post processing step in order to get a more diverse colour range from the model’s predictions.
H is a function , and Z is the output of the model. T is a hyper-parameter that the authors experimented with a few different values.
The reason why this is a good step is because the model’s output will have very valuable information about class probabilities. Instead of just taking the class that has maximum probability(like we do in image classification), the above function tries to utilise the information present about the entire distribution of probabilities of the model output.
Training such a network is divided into two parts. First , the data is passed through the model (forward pass) , then the final prediction is calculated. To calculate the loss, the inverse of H is calculated.
The results are much more vibrant, and in most cases, quite close to real. Notice that it many times it is not exactly the same as the ground truth, but it is still semantically correct(the model colors the right objects with the right color).
In this article , we discussed a novel way to colorise images using a modified loss function. We talked about how vibrancy can be controlled using hyper parameters, and why class rebalancing plays an important role in colorising natural images.
If you liked this article, hold that clap icon for as long as you think this article is worth it. I am always looking for feedback to improve my articles. If you have suggestions or questions, feel free to respond.