Usually, image recognition is done using computer vision and machine learning. The neural network trains on a set of images from which it learns to recognize certain objects in an image.
As good as neural networks are, they are not always the best choice for the job.
Certain restrictions, like the inability to retrain the model when new object classes are added or weak hardware, make it impossible to use traditional methods of image recognition.
In this article, I will share a method for image recognition that doesn’t involve neural networks and share my experience with creating a mobile app based on this approach.
The reason for ditching neural networks and searching for a different way of recognizing objects is project restrictions.
In my case, the app was created for an art museum with the intent of augmenting the visitors’ experience: guests would point their smartphone camera at a painting and receive information about the piece and the artist.
One of the main restrictions of this project was the ability to add new art pieces to the dataset without the need for model retraining, as well as quick recognition times of less than 1 second. Based on the project requirements, we had to refrain from using neural networks and go for a classic algorithm instead.
Classic algorithms have certain advantages over neural networks:
The object recognition is done by searching for key points:
Key points are points of spatial location that define whatever stands out about an image. Mathematically speaking, a key point is a point of high contrast, a point of high gradient value.
For example, key points of a chess field would be points where black squares meet white ones. A totally white image won’t have any key points as there’s no change in color within the image, whereas if we add another color, the key point will be the transition between the white background and the new color.
Back to the app in question, the art is analyzed by searching for key points. Key points have coordinates that are determined by searching for points with maximum contrast.
OpenCV has a KAZE class — a key point detector and descriptor. Using KAZE, we can search for key points in an image and generate a feature vector for each point.
The algorithm reads the vector around each key point in all directions and generates a number value that describes the key point. Based on these values, we can compare the key points by measuring the distance between vectors.
Here’s how the process of image recognition works when using the key point approach:
The algorithm requires no training, and image recognition is done only by using a mathematical approach.
It’s time to test the idea in practice and to do that, we have created a Telegram bot. All you need to do is send an image, and the system gets back to you with recognition results.
Based on these tests, we have seen that this approach not only works but is the most optimal one, given the restrictions of the project. The initial recognition accuracy was around 60%, which definitely needed an improvement along with recognition speed on mobile devices.
One approach to increasing recognition quality is the collection of key points from multiple images of the same object taken from different perspectives. This way, we would have more information about the object, thereby increasing recognition accuracy.
However, this approach has a catch: due to the large number of duplicate key points, point scoring started to work incorrectly. The easiest way to explain what went wrong is with a simple example.
Let’s say we have a dataset containing photos of two people, Garry and Mary. We have two photos of Garry and ten photos of Mary. The algorithm determines all of Garry’s key points and puts them in one group, all of Mary’s key points — into another.
Now, let’s show a photo of Garry to the algorithm. Garry and Mary are both humans and have a lot of similar features (eyes, nose, mouth, ears, etc.), and many of their key points will be similar as well. The algorithm will find seven similar key points in the “Mary” group and only 2 in the “Garry” group, thus making an incorrect assumption that the photo depicts Mary, which is incorrect.
One of the groups can win based only on a larger number of photos in a dataset.
We have solved this issue by replacing groups of similar key points with a centroid — an average of the feature vector. Doing this increased recognition quality from 60% to 90%. Additional increases in recognition quality can be made by defining requirements for images in a dataset.
Object detection based on key points comes down to assessing the similarity between them, for which you need to calculate the distance between the key point’s descriptors.
A Naive Bayes algorithm was too slow because, for each point on the test image, it needed to calculate the distances to all points in the dataset. Due to the sheer amount of points, this approach was very slow.
To speed things up, we have replaced that algorithm with HNSW — an algorithm for approximate search of nearest neighbors — which builds a hierarchical space graph. [3] Before the implementation of HNSW, the recognition took multiple seconds; after the implementation — 1 to 3 fps.
The key point approach works perfectly within the constraints of this project. However, it’s not always an ideal choice and has a set of limitations.
First of all, as I’ve mentioned earlier, not all objects have key points. Solid color or very low contrast objects can’t be detected or recognized using a key point approach.
Secondly, it’s not easy to give an accurate and definitive assessment of recognition confidence.
The confidence score is calculated by counting the matching key points for each image class. Every class has its own number of points; for example, class 1 has 3 points, class 2 has 4 points, etc. The class with the most points determines the recognition result.
When there are a lot of classes in a dataset, the entire number of points goes into a denominator, and the winner’s points go into the numerator. For example, the sum of all points of all classes is 100. The winner has 10 points, and the rest of the classes have 1 point each, but if we divide 10 by 100 the confidence score will be very low.
Thirdly, it’s difficult to ignore an input image that’s not present in a dataset. The algorithm will always find the closest similar image in a dataset, even if it has just one similar key point.