EVOLUTION OF IMAGE RECOGNITION AND OBJECT DETECTION: FROM APES TO MACHINES

From a long time, we have thought of how we could harness the amazing gift of vision because we could achieve greatness to new heights and open up endless possibilities like cars that can drive themselves. Along the path of harnessing this power, we have found numerous algorithms. all the way from simple edge detection algorithms to pixel level object detection.

What we can do to combine image recognition with Deep Learning can be to use a simple neural network that could take in an image as input and then perform operations on with that image in the hidden layer and give an output in the output layer (a vector) whose each node represents a class and the data that it receives will be the probability of the class being the image.

This would work out fine and perfectly for images of smaller size, but when using a bigger HD 1080p image for instance which will have approximately 1166400 nodes or neurons in one layer!

CNN (Convolutional Neural Network)

Ever since Alex Krizhevsky, Geoff Hinton, and Ilya Sutskever won ImageNet in 2012, Convolutional Neural Networks(CNNs) has become the gold standard for image classification. In fact, since then, CNNs have improved to the point where they now outperform humans on the ImageNet challenge!

In classification, there’s generally an image with a single object as the focus and the task is to say what that image is. But when we look at the world around us, we carry out far more complex tasks.

We recognize all the multiple objects present in our environment separately and are able to even separate all the overlapping objects present in the image!

The type of problem mentioned above, on a high level, is called object detection in which we mainly detect, classify, and locate(draw bounding boxes around the image) the objects.

R-CNN

RCNN was an early algorithm or architecture proposed to solve this task of object detection in 2014

Source: https://arxiv.org/abs/1311.2524.

The team, comprised of Ross Girshick, Jeff Donahue, and Trevor Darrel found that this problem can be solved with Krizhevsky’s results by testing on the PASCAL VOC Challenge, a popular object detection challenge akin to ImageNet.

Let us now understand the architecture of R-CNN

ARCHITECTURE OF R-CNN

The goal of R-CNN is to take in an image, and correctly identify where the main objects in the image.

But how do we do it? R-CNN does what we might intuitively do as well, propose a bunch of boxes in the image and see if any of them actually correspond to an object.

R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search. At a high level, Selective Search looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, colour, or intensity to identify objects.

Source: https://arxiv.org/abs/1311.2524.

Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet, as shown above. On the final layer of the CNN, R-CNN adds a Support Vector Machine that simply classifies whether this is an object and if so what object.

The final step of R-CNN runs a simple linear regression on the region proposal to generate tighter bounding box coordinates to get our final result.

THE PROBLEM

R-CNN works really well, but is really quite slow for a few simple reasons,

It requires a forward pass of the CNN for every single region proposal that makes it around 2000 forward passes per image which makes it slow

It has to train three different models separately — the CNN, the classifier that predicts the class, and the regression model to tighten the bounding boxes.

FAST R-CNN

In 2015, Ross Girshick, the first author of R-CNN, solved both the above-mentioned problems, leading to the second algorithm Fast R-CNN which was simpler and faster than it’s a predecessor.

1. RoI (Region of Interest) POOLING

For the forward pass of the CNN, Girshick realized that for each image, a lot of proposed regions for the image overlapped causing it to run the same CNN computation again and again (~2000 times! and also not very feasible).

Fast R-CNN solves this using a technique known as RoIPool (Region of Interest Pooling). At its core, RoIPool shares the forward pass of a CNN for an image across its subregions. Features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image instead of 1000s of passes!

2. Combine All Models into One Network

Source: https://www.slideshare.net/simplyinsimple/detection-52781995.

Earlier in R-CNN, we had different models, CNN, SVM, regressor, Fast R-CNN instead used a single network to compute all three**.** Which made the model much faster

THE PROBLEM

Even with all these advancements, there was still one remaining problem in the Fast R-CNN model, the region proposer. In Fast R-CNN, the proposals, Region of Interest, were proposed using Selective Search, a fairly slow process that was found to be the bottleneck of the overall process.

FASTER R-CNN

In the middle 2015, a team at Microsoft Research composed of Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, found a way to improve the region of proposal step almost cost-free through an architecture they named Faster R-CNN.

The insight of Faster R-CNN was that region proposals depended on features of the image that were already calculated with the forward pass of the CNN (first step of classification).

Source: https://arxiv.org/abs/1506.01497.

as shown above only a single CNN is used to both carry out region proposals and classification. This way, only one CNN needs to be trained!

REGION GENERATION

Faster R-CNN adds a Fully Convolutional Network on top of the features of the CNN creating what’s known as the Region Proposal Network.

Source: https://arxiv.org/abs/1506.01497.

The Region Proposal Network works by passing a sliding window over the CNN feature map and at each window, outputting k potential bounding boxes and scores for how good each of those boxes is expected to be. we know that we want some rectangular boxes that resemble the shapes of humansand not very very thin boxes so we create k such common aspect ratios we call anchor boxes. For each such anchor box, we output one bounding box and score per position in the image.

We then pass each such bounding box that is likely to be an object into Fast R-CNN to generate a classification and tightened bounding boxes.

MASK R-CNN

We have so far used only bounding boxes but what if we could go deeper in pixel-level object detection! instead of bounding boxes, detect each pixel of the image which can be the object.

Source: Mask R-CNN paper.

Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is straight forward, Region or position of the object

architecture of mask-rcnn. Source: https://arxiv.org/abs/1703.06870.

Mask R-CNN does pixel level object detection by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch (in white in the above image) is just a Fully Convolutional Network on top of a CNN based feature map.

RoiAlign

The Mask R-CNN authors had to make one small adjustment to make this work as expected.

In RoIPool, we would round down for instance 2.9 pixels and select 2 pixels causing a slight misalignment. However, in RoIAlign, we avoid such rounding**.** Instead, we use bilinear interpolation to get a precise idea of what would be at pixel 2.9. This, at a high level, is what allows us to avoid the misalignments caused by RoIPool.

Once these masks are generated, Mask R-CNN combines them with the classifications and bounding boxes from Faster R-CNN to generate such wonderfully precise segmentations