CEO of Beltrix Arts, AI engineer and Consultant
In this article and the following, we will take a close look at two computer vision subfields: Image Segmentation and Image Super-Resolution. Two very fascinating fields.
Two years ago after I had finished the Andrew NG course I came across one of the most interesting papers I have read on segmentation(at the time) entitled BiSeNet(Bilateral Segmentation Network) which in turn served as a starting point for this blog to grow because of a lot of you, my viewers were also fascinated and interested in the topic of semantic segmentation.
More we understand something, less complicated it becomes.
I did my best at the time to code the architecture but to be honest, little did I know back then on how to preprocess the data and train the model, there were a lot of gaps in my knowledge. I understood semantic segmentation at a high-level but not at a low-level.
Real knowledge is to know the extent of one’s ignorance.– Confucius
Fig 1: These are the outputs from my attempts at recreating BiSeNet using TF Keras from 2 years ago 😂. A true work of art!!!
Pretty amazing aren’t they? I knew this was just the beginning of my journey and eventually, I would make it work if I didn’t give up or perhaps I would use the model to produce abstract art.
With that said this is a revised update on that article that I have been working on recently thanks to FastAI 18 Course.
We change from inputting an image and getting a categorical output to having images as input and output. This is done by cutting and replacing the classification head with an upsampling path (this type of architectures are called fully convolutional networks).
Don’t worry if you don’t understand it yet, bear with me.
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image. It can be applied to a wide range of applications, such as collection style transfer, object transfiguration, season transfer and photo enhancement.
Semantic segmentation is an essential area of research in computer vision for image analysis task. The main goal of it is to assign semantic labels to each pixel in an image such as (car, house, person…).
Fig 2: Credits to Jeremy Jordan’s blog
Here the output of the network is a segmentation mask image of size (Height x Width x Classes) where Classes is the total number of classes. For the image below, we could say 128 x 128 x 7 where 7 (tree, fence, road, bicycle, person, car, building).
Just for reference, in normal Convolutional Neural Network (ConvNet) we have an image as input and after a series of transformations the ConvNet outputs a vector of C classes, 4 bounding box values, N pose estimation points, sometimes a combination of them and etc.
Fig 4: Here is an example of a ConvNet that does classification.
The easiest and simplest way of creating a ConvNet architecture to do segmentation is to take a model pretrained on ImageNet, cut the classifier head and replace it with a custom head that takes the small feature map and upsamples it back to the original size (H x W). Though it’s not the best method nevertheless it works ok.
Now, remember as we saw above the input image has the shape (H x W x 3) and the output image(segmentation mask) must have a shape (H x W x C) where C is the total number of classes. I will explain why this is important.
Fig 6: Here is an example from CAMVID dataset
The model we are going to use is ResNet-34, this model downsamples the image 5x from (128 x 128 x 3) to a (7 x 7 x 512) feature space, this saves computations because all the computations are done with a small image instead of doing computations on a large image. We cut the ResNet-34 classification head and replace it with an upsampling path using 5 Transposed Convolutions which performs an inverse of a convolution operation followed by ReLU and BatchNorm layers except the last one.
Fig 7. Upsampling path
The need for transposed convolutions(also called deconvolution) generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input. — A Guide To Convolution Arithmetic For Deep Learning, 2016.
Something interesting happened during my testing I’m not fully sure if it is the new Pytorch v1 or Fastai v1 but previously for multi-class segmentation tasks you could have your model output an image of size (H x W x 1) because as you can see in Fig 6 the shape of the segmentation mask is (960 x 720 x 1) and the matrix contains pixels ranging from 0–Classes, but with Pytorch v1 or Fastai v1 your model must output something like (960 x 720 x Classes) because the loss functions won’t work (nn.BCEWithLogitsLoss(), nn.CrossEntropyLoss() and etc), it will give you a Cuda device asserted error on GPU and size mismatch on CPU.
Fig 8. output tensor
The only case where I found outputting (H x W x 1) helpful was when doing segmentation on a mask with 2 classes, where you have an object and background.
This happens because now the loss functions essentially one hot encodes the target image(segmentation mask) along the channel dimension creating a binary matrix(pixels ranging from 0–1) for each possible class and does binary classification with the output of the model, and if that output doesn’t have the proper shape(H x W x C) it will give you an error.
This is setup if just for training, afterwards, during testing and inference you can argmax the result to give you (H x W x 1) with pixel values ranging from 0-classes.
Fig 9. My outputs using the architecture describe above
Fig 10. U-net Arch from the paper
This architecture consists of two paths, the downsampling path(left side) and an upsampling path(right side).
This method is much better than the method specified in the section above.
The main contribution of this paper is the U-shaped architecture that in order to produce better results the high-resolution features from downsampling path are combined(concatenated) with the equivalent upsampled output block and a successive convolution layer can learn to assemble a more precise output based on this information.
Another important modification to the architecture is the use of a large number of feature channels at the earlier upsampling layers, which allow the network to propagate context information to the subsequent higher resolution upsampling layer.
Context information: information providing sufficient receptive field. In the semantic segmentation task, the receptive field is of great significance for the performance.
This strategy allows the seamless segmentation of arbitrary size images.
The downsampling path can be any typical arch. of a ConvNet without the classification head for e.g: ResNet Family, Xception, MobileNet and etc. At each downsampling step, we double the number of feature channels(32, 64, 128, 256…).
Every step of the upsampling path consists of 2x2 convolution upsampling that halves the number of feature channels(256, 128, 64), a concatenation with the correspondingly cropped(optional) feature map from the downsampling path, and two 3x3 convolutions, each followed by a ReLU.
The authors of the paper specify that cropping is necessary due to the loss of border pixels in every convolution, but I believe adding reflection padding can fix it, thus cropping is optional. At the final layer, the authors use a 1x1 convolution to map each 64 component feature vector to the desired number of classes, while we don’t do this in the notebook you will find at the end of this article.
Fig 11. My outputs using a Unetish arch.
It’s a module that builds a U-Net dynamically from any model(backbone) pretrained on ImageNet, since it’s dynamic it can also automatically infer the intermediate sizes and number of in and out features.
The difference from original U-Net is that the downsampling path is a pretrained model.
This learner packed with most if not all the image segmentation best practice tricks to improve the quality of the output segmentation masks.
This learner is composed of:
Class DynamicUnetClass UnetBlock
This U-Net will sit on top of a backbone (that can be a pretrained model) and with a final output of n_classes. During the initialization, it uses Hooks to determine the intermediate features sizes by passing a dummy input through the model and create the upward path automatically.
Blur: It takes blur flag to avoid checkerboard artifacts at each layer.Self_Attention: an Attention mechanism is applied to selectively give more importance to some of the locations of the image compared to others.Bottle: it determines whether we use a bottleneck or not for the cross-connection from the downsampling path to the upsampling path.
A quasi-UNet block, that uses PixelShuffle upsampling and ICNR weight initialisation, both which are best practice techniques to eliminate checkerboard artifacts in Fully Convolutional architectures. Introduced in the checkerboard artifact free sub-pixel convolution paper.
It uses hooks to store the output of each block needed for the cross-connection from the backbone model.
There 3 key takeaways:
Thank you very much for reading, you are really amazing. I do this for you.