Hello everyone! My name is Vlad Vinogradov, I am the head of the computer vision department at EORA.ai. We have been engaged in deep learning for over three years and during that time we have implemented many projects for both Russian and international clients, which included the research and training of models.
Recently, we have focused on solving the problems of finding similar images and at the moment have created systems for finding logos, drawings, furniture, clothing, and other goods.
This post is for Machine Learning Engineers and is based on my talk Finding Similar Images - A Guide from A to Z, which was published by the Open Data Science community at Data Fest Online 2020.
This article provides background information on the recommended methods used in the Image Recovery task. After reading the article, you will be able to create a search engine for similar images for your objective from scratch (not counting the process of developing a production solution).
A Similar Image Retrieval (aka Content-Based Image Retrieval or CBIR) is any search that involves images.
The picture above from the article
Nowadays, the "Search by photo" approach is being used actively, in particular, in e-commerce services (AliExpress, Wildberries, etc.). "Keyword search" (with an understanding of the content of images) has long settled in the search engines Google, Yandex, etc., but has not yet reached marketplaces and other private search engines. I think that since the appearance of the sensational
Since our team specializes in neural networks in computer vision, in this article I will focus only on the Search by Photo approach.
Step 1. Train the model. The model can be made on the classic CV or on the basis of a neural network. Model input - image, output - D-dimensional descriptor/embedding. In the case of the classics, it can be a combination of SIFT-descriptor + Bag of Visual Words. In the case of a neural network, a standard backbone like ResNet, EfficientNet, etc. + intricate pooling layers + clever learning techniques, which we will talk about later. I can say that if you have enough data or good practice, neural networks will almost always benefit greatly (we checked), so we will focus on them.
Step 2. Indexing the base of images. Indexing is a running of the trained model on all images and writing embeddings to a special index for quick search.
Step 3. Search. Using the image uploaded by the user, the model is run, the embedding is obtained and this embedding is compared with the rest in the database. The search result is the search results sorted by relevance.
The neural network in the task of finding similarities is used as a feature extractor (backbone). The choice of backbone depends on the volume and complexity of the data - you can consider everything from
The first feature of the models in Image Retrieval is the magic in the head of the neural network. On the__Image Retrieval leaderboard__, they are fighting for building the best descriptors - there are Combined Global Descriptors with parallel pools and Batch Drop Block for a more even distribution of activation over the output feature map.
The second main feature is the loss functions. There are a lot of them. In
This is a double loss, i.e. objects are compared by the distance between each other.
The neural network is penalized for the distance from each other of the embeddings of images p and q if these images are actually similar. Similarly, there is a penalty for the proximity of embeddings, the images of which are actually different from each other. In this case, in the latter case, we set the boundary m (for example, 0.5), overcoming which, we believe that the neural network has coped with the task of "separating" dissimilar images.
Here we aim to minimize the distance from the anchor to the positive and maximize the distance from the anchor to the negative. Triplet Loss was first introduced in Google's
N-tupled Loss- a development of Triplet Loss, which also takes anchor and positive, but uses multiple negatives instead of one negative.
The problem of paired losses is in choosing combinations of positives, negatives, and anchors - if they are just taken uniformly random from the dataset, then the problem of "light pairs" arises. These are such simple pairs of images for which the loss will be 0. It turns out that the network converges quickly enough to a state in which most of the elements in the batch will be "easy" for it, and the loss for them will turn out to be zero - the network will stop learning. To avoid this problem, they began to come up with sophisticated mining techniques for pairs - hard negative and hard positive mining. You can read more about the problem in this article. There is also a PML library, which implements many mining methods, and in general, the library contains a lot of useful information on the Metric Learning task on PyTorch.
Another solution to the problem is classification losses. Consider one popular bug feature that led to state-of-the-art facial recognition three years ago - ArcFace.
The main idea is to add an indent m to the usual cross-entropy, which distributes the embeddings of images of one class in the region of the centroid of this class so that they are all separated from the clusters of embeddings of other classes by at least an angle m.
This seems to be the perfect loss feature, especially when you look at the MegaFace benchmark. But you need to keep in mind that it will work only if there is a classification markup. If you don't have one, you will have to work with paired losses.
Here I visually show you which loss functions are best used when you have single-class and multi-class markup (you can derive paired markup from the latter by counting the percentage of intersection between multilabel vectors of examples).
Let's go back to the neural network architecture and consider a couple of pooling layers used in Image Retrieval tasks.
Regional Maximum Activation of Convolutions (R-MAC) is a pooling layer that accepts the output map of the neural network (before the global pooling or classification layers) and returns a descriptor vector calculated as the sum of activations in different windows of the output map. Here, the activation of the window is to take the maximum for this window for each channel independently.
The resulting descriptor takes into account the local features of the image at different scales, thereby creating a rich feature description. This descriptor itself can be an embedding, so it can be sent directly to the loss function.
Generalized Mean (GeM) is a simple pooling that can improve the quality of the output descriptor. The bottom line is that the classical average pooling can be generalized to the lambda norm. By increasing the lambda, we make the network focus on significant parts of the image, which can be important in certain tasks.
The key to a high-quality search for similar images is ranking, i.e. displays the most relevant examples for a given query. It is characterized by the speed of building the descriptor index, the speed of the search, and the memory consumed.
The simplest thing is to keep the embeddings "head-on" and do a brute-force search on them, for example, using the cosine distance. Problems appear when there are a lot of embeddings - millions, tens of millions, or even more. The search speed is significantly reduced, the amount of heap occupied increases. One positive thing remains - the search quality is perfect with the existing embeddings.
These problems can be solved at the expense of quality - to store embeddings, not in their original form, but compressed (quantized). And also to change the search strategy - not to search for brute-force, but to try for the minimum number of comparisons to find the required number of those closest to the given query. There are a large number of efficient frameworks for approximate search for the nearest ones. A special
Most Popular:
Researchers in the Information Retrieval field have long understood that ordered search results can be improved by some way of reordering items after the original search results have been received.
One such method is Query Expansion. The idea is to use the top-k of the closest elements to generate a new embedding. In the simplest case, you can take the averaged vector, as shown in the picture above. You can also weight embeddings, for example, by distance in the issue or cosine distance from the request. Such improvements are described in a single framework in the article
k-reciprocal - a set of elements from top-k, including the k nearest of which is the request itself. On the basis of this set, the process of re-ranking the results is built, one of which is described in the article
We come to the part of checking the quality of the search for similar ones. There are many subtleties in this task that may not be noticed by beginners when they first start working on an Image Retrieval project.
Let's consider such popular metrics in the Image Retrieval task: precision@k, recall@k, R-precision, mAP и nDCG.
Pros
Cons
Same as precision@k, where k is set equal to the number of relevant queries.
Pros
Cons
Shows what proportion of relevant items were found in top-k
Pros
Shows how densely we fill the top of the search results with relevant examples. You can look at this as the amount of information received by the user of the search engine, which has read the least number of pages. Accordingly, the greater the amount of information to the number of pages read, the higher the metric.
Pros
Cons
you have to know the total number of relevant ones to the request (it can be a problem if not all relevant ones are marked up)
You can read more about metrics in Information Retrieval, including the mAP output,
This metric shows how correctly the elements in top-k are ordered among themselves. We will not consider the pros and cons of this metric, since this is the only metric in our list that takes into account the order of elements. However, there are studies showing that this metric is fairly stable when it is necessary to take into account the order and can be suitable in most cases.
macro: a metric is calculated for each request, averaging over all requests
pros: there are no significant fluctuations with a different number of relevant to this query
cons: all queries are considered equal, even if some are more relevant than others
micro: the number of tagged relevant and separately successfully found relevant is summed over all queries, then participates in the fraction of the corresponding metric
pros: queries are evaluated taking into account the number of tagged relevant for each of them
cons: the metric can become very low / very high if there were a lot of tagged relevant ones for some request and the system unsuccessfully / successfully brought them to the top
proposing to consider two options for validation
Input: image requests and images relevant to them. There is markup in the form of a list relevant to this query.
To calculate metrics, you can calculate the relevance matrix each with each and, based on the binary information about the relevance of the elements, calculate the metrics.
Input: image requests, and images relevant to them. There should also be a validation database of images, in which, ideally, all relevant queries are marked. Also, it should not contain query images, otherwise, they will have to be cleaned at the search stage so that they do not clog the top-1. The validation base is involved as a base of negatives - our task is to draw out those that are relevant to it.
To calculate metrics, you can go through all the requests, calculate the distances to all elements, including relevant ones, and send them to the metrics calculation function.
Some companies argue with other companies that the latter should not use the pictorial elements of the former's brand. In such cases, a weaker manufacturer tries to parasitize on the success of a large brand by presenting its products and services under similar symbols. Users also suffer from this - you can mistakenly buy cheese from the wrong manufacturer, which you already trust, but take a fake product without carefully reading the label. An example from a recent one:
There are special companies, governmental and private, to fight illegal immigration. They have a register of registered trademarks by which they can compare new incoming trademarks and approve / reject an application for the registration of a trademark. Above is an example of the interface of the
The size of the indexed database of images: several million trademarks. Here the first image is a query, on the next line is a list of expected relevant ones, the rest of the lines are what the search engine gives out in order of decreasing relevance.
That's all. It was overview material. Hopefully, those who are building or about to build similar image search engines have gained some benefit. And vice versa, if you think that I am wrong somewhere, tell me in the comments, I will be glad to receive feedback.