Is Object Detection a Done Deal?

A few years back it was widely known that Object Detection was a hard problem to solve. The comic below was just a few years back. Things have changed in this short time quite drastically.

With the advent of Deep Neural Network Architecture -Convolutional Neural Network (CNN) in particular, as well as the development of the CUDA library that started to use the multicore characteristic of the gaming/rendering GPU’s and the open collaborative research, things have changed drastically for the better.

Not only would it be possible to recognize a bird, but which bird as well. Here is a snapshot from Google Cloud API Demo link — https://cloud.google.com/vision/docs/drag-and-drop

The Google vision API is 99 percent confident that the image is of a bird, and 75% confident of which family of birds it belongs to !! Coraciiformes are a group of usually colorful birds including the kingfishers, the bee-eaters, the rollers, the motmots, and the todies — https://en.wikipedia.org/wiki/Coraciiformes)

Things that would have taken a 15-year research team to do are now rapidly becoming reality. But there are caveats here.

All is not so Great

I am not a researcher but have been basically using open source algorithms and frameworks for Object detection for about two years now. Started from the ML-based HOG and HAAR in OpenCV, then the faster version of that via CUDA and GPU and finally since tuning the parameters of these systems to works across different videos was proving to be futile, went ahead with the neural network based method; I wanted to write this as there is a tendency by many who have used the opensource implementations like Yolo, to think that it is a done deal; also heavy marketing by a lot of small and specialized companies, who follow similar thinking, and promising visual automation, either customizable or customized for some vertical. (does it remind one of IBM Watson marketing and the place it occupies now).

Maybe when we humans see a system is able to detect and/or classify some images perfectly, we have a tendency to imagine and extend the capability to all scenarios; because we, humans, are great in generalizing; and with CNN we have something similar, better in generalizing features, but nowhere great yet. Read on.

The Good

Here is a result from a photo I took a while back; I used the Google Cloud Vision API Demo page to upload and check. I choose Google API because they are the best in this/ or one of the best. Wow! the results are amazing; not only has it detected that it is a Flower with 86 percent confidence (did it miss the beetle?), but it has also correctly identified the family Morning Glory Family! This is some information! I am really impressed. Anyone would be. Same like IBM Watson beating the human contenders in Jeopardy, or the Go playing AlphaMind from Google DeepMind.

When we see output like this our expectation increases exponentially. We tend to equate the system with human-like abilities for vision, with computer-like fastness and correlation to digitized information. The perfect marriage. Imagine what a trained network can do in medical scans. Every vision related problem seems to be generalized as a possibility and then automated and augmented with information to create a system.

This is partly true, but there are gaps, large gaps, not unbridgeable, but which requires work. Let us see a few.

The Bad & The U..ly

Note that in the technical architecture of CNN there is more complexity in object detection than object classification. Image classifiers have very high accuracy in test’s compared to detectors (they need to also detect the position of the object and draw a bounding box on the image). The Google Vision API is doing image classification.

A system that can classify a flower into its family would surely be able to decipher the details from the below photo?

However, not a single Car is ‘detected’ (technically this is a classifier). Let us give another clearer shot to the system.

Still, no Car detected. Let us make the car/cars slightly bigger

Now it is detecting cars with 96% confidence. Why is that?

Scale Invariance

Let us make the car a lot bigger

It not only is 99% sure that it is a car but also detects it is a Toyota Corolla with 52% confidence.

So if the car image is small, like from an areal shot, it is not able to detect at all and if it is of a reasonable size it is 99 percent confident that it is a car. Are CNN’s really scale invariant? Short answer -No. Max pooling in the CNN helps a bit; but not for large changes. Read on for more explanation.

Convolution is not naturally equivariant to some other transformations, such as changes in the scale or rotation of an image. — Deep Learning book by Ian Goodfellow and Yoshua Bengio and Aaron Courvill

Rotation Invariance

CNN's (or the current CNN networks) are not rotation invariant, it has to be trained for that (data augmentation) to get the effect. Even in a highly trained NW like Google Vision API we can see some effects of this. Let us do a small change in the above test image of a car; I invert the above car picture and try.

It is detecting as a Car with high accuracy still; no problem with that; The car make -Toyota is now detected wrongly as BMW (not visible in above but you can see the same in the picture below)

I rotate it still 90 degrees.

And then it loses the BMW confidence too, but Wheels and other auto parts are gaining in confidence.

This will be more evident with images that are less common. Here is an image of a tap rotated 90 degrees and you could see the confidence changing (Chair ?)

During training, each image is usually augmented via transformations to avoid these sort of errors, minimize translation invariance.

But as you can see, when the angle changes from the training set, due to camera angle or taking random pictures from the wild, the output changes too. I have just used random images; my aim here is not to show how bad the system is and confuse the system with difficult images, but to point out certain aspects.

This may seem surprising to many. CNN’s are supposed to be scale invariant and translation and rotation invariant. Or is that just loose trade talk. Going slightly technical, let’s dig further

There is a general conception that Pooling (Max Pooling) provides scale and translation invariance. This is both true and false. What needs to be understood that pooling helps in ‘learning’ invariance, and for learning the NW should be trained with images. Also, CNN’s are invariant to translation. I guess there are few who think this means invariant to rotation also. But translation here means shifting the position of the object left, right , up or down. (shown here clearly, the picture below)

source https://medium.com/r/?url=https%3A%2F%2Fstats.stackexchange.com%2Fa%2F208949%2F191675

Maxpooling helps here. This answer illustrates this lucidly. Here no data augmentation is needed. Assuming that a CNN is good in detecting a picture of a cat, it will detect a cat translated anywhere in the frame.

Here is from a very reputed source the Deep Learning book by Ian Goodfellow and Yoshua Bengio and Aaron Courvill. Along with Geoffrey Hinton and Yann Lecun, Bengio is considered one of the three people most responsible for the advancement of deep learning during the 1990s and 2000s”

In all cases, pooling helps to make the representation approximately invariant to small translations of the input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change — Deep Learning book , http://www.deeplearningbook.org/

But regarding scale invariance and rotation invariance; here is from the same book

Convolution is not naturally equivariant to some other transformations, such as changes in the scale or rotation of an image.

And there are other papers that have tested current networks and reported the same. Here is a quote from a Dec 2017 paper (2)

“We obtain the surprising result that architectural choices such as the number of pooling layers and the convolution filter size have only a secondary effect on the translation invariance of a network. Our analysis identifies training data augmentation as the most important factor in obtaining translation-invariant representations of images using convolutional neural networks.” From “Quantifying Translation-Invariance in Convolutional Neural Networks (Eric Kauderer-Abrams Stanford University) “

And from another recent paper May 2018

Deep convolutional neural networks (CNNs) have revolutionized computer vision. Perhaps the most dramatic success is in the area of object recognition, where performance is now described as “superhuman” [20]. …

Despite the excellent performance of CNNs on object recognition, the vulnerability to adversarial attacks suggests that superficial changes can result in highly non-human shifts in prediction …

Obviously, not any data augmentation is sufficient for the networks to learn invariances. To understand the failure of data augmentation, it is again instructive to consider the subsampling factor. Since in modern networks the subsampling factor is approximately 45, then for a system to learn complete invariance to translation only, it would need to see 452 = 2025 augmented versions of each training example. If we also add invariance to rotations and scalings, the number grows exponentially with the number of irrelevant transformations

From Why do deep convolutional networks generalize so poorly to small image transformations? Yair Weiss, Aharon Azulay ELSC Hebrew University of Jerusalem https://arxiv.org/pdf/1805.12177.pdf

If that is the case, how do the Google API been able to recognizance the inverted and rotated car in the tests that we showed earlier ? (notice that it got the car pretty high, and only missed on the other details like brand, which it may have not trained that strong).

Data Augmentation is the Key

The key is data augmentation. Basically the input image is used along with rotations, scaling, noise etc generated from the image as other images to the training. Some good explanation is here https://medium.com/ymedialabs-innovation/data-augmentation-techniques-in-cnn-using-tensorflow-371ae43d5be9.

CNN’s are scale invariant to some level, if it is trained to be; as pooling implementation will then be able to handle that**. Also rotational invariance has to be trained in.**

‘Learning’ Invariance to Rotation via Pooling

Let us see rotational invariance first , how a CNN can be trained for that first as it is bit easier. Here is the illustration from the Deep Learning book.

source pg 338 http://www.deeplearningbook.org/contents/convnets.html

Example of learned invariances. A pooling unit that pools over multiple features that are learned with separate parameters can learn to be invariant to transformations of the input. Here we show how a set of three learned ﬁlters and a max pooling unit can learn to become invariant to rotation. All three ﬁlters are intended to detect a hand written 5.Each ﬁlter attempts to match a slightly diﬀerent orientation of the 5. When a 5 appears in the input, the corresponding ﬁlter will match it and cause a large activation in a detector unit. The max pooling unit then has a large activation regardless of which detector unit was activated…

pg 338 http://www.deeplearningbook.org/contents/convnets.html

Basically we need to either augment the training images by rotating or get a data pool of images which are taken at different angles and use them for training the CNN. (also see alternative method — where a learnable transofrmation module is added to the CNN, which take in the input image and is learned to apply tranformations to it to improve detection Spatial Transformer NW Google DeepMind — not used myself)

Drill Down — The problme of CNN invariance to Scale

This is a little more complex. For real-time detection, we use a CNN called a Single Shot Detector. Single shot detectors sacrifice some accuracy for performance.

Here is one picture you may have seen from the YOLO home page.

from https://pjreddie.com/darknet/yolo/

Multi object detection . Note- Detection is different or more difficult from classification in that it needs to also predict the bounding boxes that the object is present in.

Here is the output on a previous version of Yolo (Yolo v2, the current v3 seems to have improved a lot) on a pictures taken at a height.

Yolo v2 from an arieal picture- Image resolution problem

And if you think these type of pictures or use cases are rare -many real world use cases are very similar to this. This is one problem in using Object Detection for real world products; it is very hard to test; You may have a highly trained person detector- but are you sure you have enough images for all the skin tones, facial features and attires that it will work very well across the world?

Most of the things that work very well in a demo is sometimes useless in production or for a particular customers use case; and one reason what has prompted me to write this.

As I said why the NW does not detect small sizes though trained well for large can be due to two reasons.

Problem 1: Limit of Input resolution

In YoloV2 it scales images down; the input image was a frame from a HD video feed. Scaling it from input (1280*720) down to (416*416) immediately destroys lot of features, especially of small objects. This is the first problem. Lesson learned -use a NW implementation that will take higher resolution images, plus have a decent GPU with enough memory (GTX 1080 should do for a start). Note that each Convolution and Max Pooling layer is further reducing features. Max pooling is essential for postional invariance; but the final effect of detecting small objectes is very bad.

If we cut the above frame into 4 frames and give it to Yolo v2 individually and then stitch together, it performs well (a good solution at that time by one of my team mate Sai Narasimha Vennamaneni). There is a cost involved here; one of speed; and then the complexity overhead of removing overlapping boundary boxes; as a straight slicing may cut the objects itself in the boundaries; so the logic of overlapped cutting and then ignoring possible duplicates has to be done.

obfuscated image for demo

Problem 2: CNN layers removes feature; not good news for small object detection with deep neural networks.

This is a bigger problem. Each convolution layer basically looks for some patterns while losing some details; so at some depth, all these small cars features completely vanish.

SSD uses layers already deep down into the convolutional network to detect objects. If we redraw the diagram closer to scale, we should realize the spatial resolution has dropped significantly and may already miss the opportunity in locating small objects that are too hard to detect in low resolution. If such problem exists, we need to increase the resolution of the input image.

from https://medium.com/@jonathan_hui/what-do-we-learn-from-single-shot-object-detectors-ssd-yolo-fpn-focal-loss-3888677c5f4d

What do we learn from single shot object detectors (SSD, YOLOv3), FPN & Focal loss (RetinaNet)?_In part 2, we will have a comprehensive review of single shot object detectors including SSD and YOLO (YOLOv2 and…_medium.com

Here is a little more technical explanation from a recent published paper

Since feature maps of layers closer to the input are of higher resolution and often contain complementary information (wrt. conv5), these featuresare either combined with shallower layers (like conv4, conv3) [23, 31, 1, 31] or independent predictions are made at layers of different resolutions [36, 27, 3]. Methods like SDP [36], SSH [29] or MS-CNN [3], which make independent predictions at different layers, also ensure that smaller objects are trained on higher resolution layers (like conv3) while larger objects are trained on lower resolution layers (like conv5).

An Analysis of Scale Invariance in Object Detection — SNIPBharat Singh Larry S. Davis University of Maryland, College Park

http://openaccess.thecvf.com/content_cvpr_2018/papers/Singh_An_Analysis_of_CVPR_2018_paper.pdf

Excellent blogs from Jonathan Hui ; he explains here how Yolo v3 overcomes this problem with Feature Pyramid; so this may not be too much of a problem now, also other NW like Retina net perform well as well for small objects. But one needs to test and check.

Here is from another paper April 2018

We provide an illustration of the motivation of the paper …. Pedestrian instances in the automotive images (e.g., Caltech dataset [11]) often have very small sizes….. Accurately localizing these small-size pedestrian instances is quite challenging due to the following difficulties. Firstly, most of the small-size instances appear with blurred boundaries and obscure appearance. It is difficult to distinguish them from the background clutters and other overlapped instances. Secondly, the large-size pedestrian instances typically exhibit dramatically different visual characteristics from the small-size ones

source https://ieeexplore.ieee.org/abstract/document/8060595

For instance, body skeletons of the large-size instances can provide rich information for pedestrian detection while skeletons of the small-size instances cannot be recognized so easily. S_uch differences can also be verified by comparing the generated feature maps for large-size and small-size pedestrians, as shown in Fig. 1_.

From Scale-Aware Fast R-CNN for Pedestrian Detection By Jianan Li ; Xiaodan Liang ; Shengmei Shen ; Tingfa Xu ; Jiashi Feng ; Shuicheng Yan

Scale Invariance- Training it in

From my experience, the CNN’s currently are not scale invariant. It may be do to the above two factors, feature loss when the target image is small ,compounded with the features loss in deep neural network. However we have found that if we are able to prepare a training data set that have both small and large objects the current network is able to detect different scales with the same class as long as it can work on input images without scaling down much.So training the network has become sort of a skill now.

The Elephant in the room; Need for a large good quality human annotated image set for Training

Here is the most painful thing about CNN’s today - you need thousands to literally hundred thousands of good annotated images of an object for training; that is good enough generalization without over-fitting.

The presence of COCO image set is for image detection, what ImageNet is for image classification.

However there is a high chance that the object you want to detect is not one of the 80 classes of images in COCO.

Why is this so important ? For this, we need to understand a bit about generalization.

The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs — not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization Chapter 5.2 Deep Learning book

When a neural net trains, it uses the divergence from the test data to learn the correct weights via back propagation. If there are only few images* to train on, the NW will learn too well (or be too specific) to the training data, and will perform worse on data in the wild. To reduce this, there are regularization techniques used. Instead of just train and test, there is also a third set of images called validation set, and if the results starts to diverge too much from validation set, though it matches the test set more, then it is an indication to do a ‘early stop’ of the training.

The other regularization option is drop-out.

Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random. https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5

A lot of questions abound on the internet regarding how to prevent over-fitting https://github.com/keras-team/keras/issues/4325. Also see this excellent article with sample — https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/

Apart from the methods above, the main part is to have enough data points to train with.

To prevent overfitting, the best solution is to use more training data. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization https://www.tensorflow.org/tutorials/keras/overfit_and_underfit

Since we are on the topic of generalization ( and the role of regularisation techniques to prevent generalisation), it may also be the place to mention about the popular paper Understanding deep learning requires rethinking generalization by Chiyuan Zhung et al. It shows that DNN especially CNN’s can ‘memorize’ instead of generalizing see this quora answer which basically means that if you train with say faces and which has some images with random noise instead of proper faces, the NW will still converge — test accuracy nearing 100 percent. Which means there are lot of parts of this ‘learning’ process which have popular explanations, but also has unknown explanations. Wanted to put this in even though slighly confusing in that, sometimes you may find your trained network not detecting a particular image and then you need to hunt for theories to explain better and then try some solution- object classification itself is not a done deal yet.

A word about Transfer Learning

In the recent Google NEXT event AutoML was presented.If using AutoML for Vision, it was claimed that ten to twenty images of leaves are all what is needed for training. I am not sure of the internals of AutoML, but my inference is that, it could be from transfer-learning (practically for a NW like Retinanet described here).

Here is the same sentiment from another source

The origin of the 1,000-image magic number comes from the original ImageNet classification challenge, where the dataset had 1,000 categories, each with a bit less than 1,000 images for each class (…. This was good enough to train the early generations of image classifiers like AlexNet, and so proves that around 1,000 images is enough.

Can you get away with less though? Anecdotally, based on my experience, you can in some cases but once you get into the low hundreds it seems to get trickier to train a model from scratch. The biggest exception is when you’re using transfer learning on an already-trained model. Because you’re using a network that has already seen a lot of images and learned to distinguish between the classes, you can usually teach it new classes in the same domain with as few as ten or twenty examples.From https://petewarden.com/2017/12/14/how-many-images-do-you-need-to-train-a-neural-network/

But if we have to detect for an object class of an image that is not in the same domain as other images on it is trained for, this transfer-learning will not work. To give a simple example — it is definitely possible to train a system to detect based on few images of say nails; but then it will see everything as nails- literally. Basically since CNN’s are very deep neural networks, they need a lot of data( read images) to generalize .This calls for lot of work in collecting the required images, and then annotating it; and then training the network in a way, and till such time as to get the optimal result,preventing underfitting or overfitting.

The Future

If you can see a glimpse of light, you can already start imagining the sky; I guess very soon we will be out of this tunnel going by research -CapsuleNet coming from Geoffrey Hinton (one of the ‘Godfather’ of recent AI resurgence — one who co-introduced back propogation) about whats wrong with current CNN’s and how his idea of CapsuleNet architecture would be better.