How to Build an Image Search Engine to Find Similar Images

Hello everyone! My name is Vlad Vinogradov, I am the head of the computer vision department at EORA.ai. We have been engaged in deep learning for over three years and during that time we have implemented many projects for both Russian and international clients, which included the research and training of models.

Recently, we have focused on solving the problems of finding similar images and at the moment have created systems for finding logos, drawings, furniture, clothing, and other goods.

This post is for Machine Learning Engineers and is based on my talk Finding Similar Images - A Guide from A to Z, which was published by the Open Data Science community at Data Fest Online 2020.

This article provides background information on the recommended methods used in the Image Recovery task. After reading the article, you will be able to create a search engine for similar images for your objective from scratch (not counting the process of developing a production solution).

About the project

A Similar Image Retrieval (aka Content-Based Image Retrieval or CBIR) is any search that involves images.
The picture above from the article Recent Advance in Content-based Image Retrieval: A Literature Survey (2017) will tell you about the problem most easily.

Nowadays, the "Search by photo" approach is being used actively, in particular, in e-commerce services (AliExpress, Wildberries, etc.). "Keyword search" (with an understanding of the content of images) has long settled in the search engines Google, Yandex, etc., but has not yet reached marketplaces and other private search engines. I think that since the appearance of the sensational CLIP: Connecting Text and Images in computer vision circles, the globalization of this approach will accelerate.

Since our team specializes in neural networks in computer vision, in this article I will focus only on the Search by Photo approach.

Basic service components

Step 1. Train the model. The model can be made on the classic CV or on the basis of a neural network. Model input - image, output - D-dimensional descriptor/embedding. In the case of the classics, it can be a combination of SIFT-descriptor + Bag of Visual Words. In the case of a neural network, a standard backbone like ResNet, EfficientNet, etc. + intricate pooling layers + clever learning techniques, which we will talk about later. I can say that if you have enough data or good practice, neural networks will almost always benefit greatly (we checked), so we will focus on them.

Step 2. Indexing the base of images. Indexing is a running of the trained model on all images and writing embeddings to a special index for quick search.

Step 3. Search. Using the image uploaded by the user, the model is run, the embedding is obtained and this embedding is compared with the rest in the database. The search result is the search results sorted by relevance.

Neural Networks and Metric Learning

The neural network in the task of finding similarities is used as a feature extractor (backbone). The choice of backbone depends on the volume and complexity of the data - you can consider everything from ResNet18 to Visual Transformer.

The first feature of the models in Image Retrieval is the magic in the head of the neural network. On the__Image Retrieval leaderboard__, they are fighting for building the best descriptors - there are Combined Global Descriptors with parallel pools and Batch Drop Block for a more even distribution of activation over the output feature map.

The second main feature is the loss functions. There are a lot of them. In Deep Image Retrieval: A Survey alone, there are over a dozen recommended pair losses. There are also the same numbers of classification ones. The main point of all these losses is to train the neural network to transform the image into a vector of linearly separable space, so that further these vectors can be compared by cosine or Euclidean distance: similar images will have close embeddings, dissimilar images will have distant ones. Let's take a closer look.

Loss Functions

Contrastive Loss

This is a double loss, i.e. objects are compared by the distance between each other.

The neural network is penalized for the distance from each other of the embeddings of images p and q if these images are actually similar. Similarly, there is a penalty for the proximity of embeddings, the images of which are actually different from each other. In this case, in the latter case, we set the boundary m (for example, 0.5), overcoming which, we believe that the neural network has coped with the task of "separating" dissimilar images.

Triplet Loss

Here we aim to minimize the distance from the anchor to the positive and maximize the distance from the anchor to the negative. Triplet Loss was first introduced in Google's FaceNet article on facial recognition and has long been a state-of-the-art solution.

N-tupled Loss

N-tupled Loss- a development of Triplet Loss, which also takes anchor and positive, but uses multiple negatives instead of one negative.

Angular Additive Margin (ArcFace)

The problem of paired losses is in choosing combinations of positives, negatives, and anchors - if they are just taken uniformly random from the dataset, then the problem of "light pairs" arises. These are such simple pairs of images for which the loss will be 0. It turns out that the network converges quickly enough to a state in which most of the elements in the batch will be "easy" for it, and the loss for them will turn out to be zero - the network will stop learning. To avoid this problem, they began to come up with sophisticated mining techniques for pairs - hard negative and hard positive mining. You can read more about the problem in this article. There is also a PML library, which implements many mining methods, and in general, the library contains a lot of useful information on the Metric Learning task on PyTorch.

Another solution to the problem is classification losses. Consider one popular bug feature that led to state-of-the-art facial recognition three years ago - ArcFace.

The main idea is to add an indent m to the usual cross-entropy, which distributes the embeddings of images of one class in the region of the centroid of this class so that they are all separated from the clusters of embeddings of other classes by at least an angle m.

This seems to be the perfect loss feature, especially when you look at the MegaFace benchmark. But you need to keep in mind that it will work only if there is a classification markup. If you don't have one, you will have to work with paired losses.

Here I visually show you which loss functions are best used when you have single-class and multi-class markup (you can derive paired markup from the latter by counting the percentage of intersection between multilabel vectors of examples).

Pooling

Let's go back to the neural network architecture and consider a couple of pooling layers used in Image Retrieval tasks.

R-MAC

Regional Maximum Activation of Convolutions (R-MAC) is a pooling layer that accepts the output map of the neural network (before the global pooling or classification layers) and returns a descriptor vector calculated as the sum of activations in different windows of the output map. Here, the activation of the window is to take the maximum for this window for each channel independently.

The resulting descriptor takes into account the local features of the image at different scales, thereby creating a rich feature description. This descriptor itself can be an embedding, so it can be sent directly to the loss function.

Generalized Mean

Generalized Mean (GeM) is a simple pooling that can improve the quality of the output descriptor. The bottom line is that the classical average pooling can be generalized to the lambda norm. By increasing the lambda, we make the network focus on significant parts of the image, which can be important in certain tasks.

Ranging

Index

The key to a high-quality search for similar images is ranking, i.e. displays the most relevant examples for a given query. It is characterized by the speed of building the descriptor index, the speed of the search, and the memory consumed.

The simplest thing is to keep the embeddings "head-on" and do a brute-force search on them, for example, using the cosine distance. Problems appear when there are a lot of embeddings - millions, tens of millions, or even more. The search speed is significantly reduced, the amount of heap occupied increases. One positive thing remains - the search quality is perfect with the existing embeddings.

These problems can be solved at the expense of quality - to store embeddings, not in their original form, but compressed (quantized). And also to change the search strategy - not to search for brute-force, but to try for the minimum number of comparisons to find the required number of those closest to the given query. There are a large number of efficient frameworks for approximate search for the nearest ones. A special benchmark has been created for them, where you can see how each library behaves on different datasets.

Most Popular: NMSLIB, Spotify Annoy, Facebook Faiss, Google Scann. Also, if you want to take indexing with the REST API out of the box, you can consider the Jina application.

Re-ranking

Researchers in the Information Retrieval field have long understood that ordered search results can be improved by some way of reordering items after the original search results have been received.

One such method is Query Expansion. The idea is to use the top-k of the closest elements to generate a new embedding. In the simplest case, you can take the averaged vector, as shown in the picture above. You can also weight embeddings, for example, by distance in the issue or cosine distance from the request. Such improvements are described in a single framework in the article Attention-Based Query Expansion Learning. Optionally, you can apply Query Expansion recursively.

k-reciprocal

k-reciprocal - a set of elements from top-k, including the k nearest of which is the request itself. On the basis of this set, the process of re-ranking the results is built, one of which is described in the article Re-ranking Person Re-identification with k-reciprocal Encoding. By definition, k-reciprocal is closer to the query than k-nearest neighbors. Accordingly, one can roughly consider the elements that are in the k-reciprocal set as knowingly positive and change the weighting rule, for example, for Query Expansion. In this article, a mechanism has been developed for recalculating distances using k-reciprocal sets of the elements themselves in top-k. The article contains a lot of calculations, this is beyond the scope of this post, so I suggest the readers familiarize themselves with it.

Validation

We come to the part of checking the quality of the search for similar ones. There are many subtleties in this task that may not be noticed by beginners when they first start working on an Image Retrieval project.

Metrics

Let's consider such popular metrics in the Image Retrieval task: precision@k, recall@k, R-precision, mAP и nDCG.

precision@R

Pros

Shows the percentage of top-k responses that are relevant.

Cons

very sensitive to the number of relevant examples for a given query, which does not allow an objective assessment of the quality of search, where there are different numbers of relevant examples for different queries
reaching the value 1 is possible only if the number of relevant >= k for all queries

R-precision

Same as precision@k, where k is set equal to the number of relevant queries.

Pros

sensitivity to the number k in precision@k disappears, the metric becomes stable

Cons

you have to know the total number of relevant ones to the request (it can be a problem if not all relevant ones are marked up)

Recall@k

Shows what proportion of relevant items were found in top-k

Pros

answers the question of whether found relevant in principle among the top-k
stable and well averaged over requests

mAP (mean Average Precision)

Shows how densely we fill the top of the search results with relevant examples. You can look at this as the amount of information received by the user of the search engine, which has read the least number of pages. Accordingly, the greater the amount of information to the number of pages read, the higher the metric.

Pros

objective stable assessment of search quality
is a one-digit representation of the precision-recall curve, which itself is rich in information for analysis

Cons

you have to know the total number of relevant ones to the request (it can be a problem if not all relevant ones are marked up)

You can read more about metrics in Information Retrieval, including the mAP output, here

nDCG (Normalized Discounted Gain)

This metric shows how correctly the elements in top-k are ordered among themselves. We will not consider the pros and cons of this metric, since this is the only metric in our list that takes into account the order of elements. However, there are studies showing that this metric is fairly stable when it is necessary to take into account the order and can be suitable in most cases.

Estimation

macro: a metric is calculated for each request, averaging over all requests

pros: there are no significant fluctuations with a different number of relevant to this query

cons: all queries are considered equal, even if some are more relevant than others

micro: the number of tagged relevant and separately successfully found relevant is summed over all queries, then participates in the fraction of the corresponding metric

pros: queries are evaluated taking into account the number of tagged relevant for each of them

cons: the metric can become very low / very high if there were a lot of tagged relevant ones for some request and the system unsuccessfully / successfully brought them to the top

Validation system

proposing to consider two options for validation

Validation on a set of queries and selected relevant ones

Input: image requests and images relevant to them. There is markup in the form of a list relevant to this query.

To calculate metrics, you can calculate the relevance matrix each with each and, based on the binary information about the relevance of the elements, calculate the metrics.

Full-base validation

Input: image requests, and images relevant to them. There should also be a validation database of images, in which, ideally, all relevant queries are marked. Also, it should not contain query images, otherwise, they will have to be cleaned at the search stage so that they do not clog the top-1. The validation base is involved as a base of negatives - our task is to draw out those that are relevant to it.

To calculate metrics, you can go through all the requests, calculate the distances to all elements, including relevant ones, and send them to the metrics calculation function.

Completed project example

Some companies argue with other companies that the latter should not use the pictorial elements of the former's brand. In such cases, a weaker manufacturer tries to parasitize on the success of a large brand by presenting its products and services under similar symbols. Users also suffer from this - you can mistakenly buy cheese from the wrong manufacturer, which you already trust, but take a fake product without carefully reading the label. An example from a recent one: Apple Is Attempting to Block a Pear Logo Trademark

There are special companies, governmental and private, to fight illegal immigration. They have a register of registered trademarks by which they can compare new incoming trademarks and approve / reject an application for the registration of a trademark. Above is an example of the interface of the WIPO system. In such systems, a search for similar images will be a good helper - the expert will be able to find analogs faster.

Examples of how our system works

The size of the indexed database of images: several million trademarks. Here the first image is a query, on the next line is a list of expected relevant ones, the rest of the lines are what the search engine gives out in order of decreasing relevance.

Conclusion

That's all. It was overview material. Hopefully, those who are building or about to build similar image search engines have gained some benefit. And vice versa, if you think that I am wrong somewhere, tell me in the comments, I will be glad to receive feedback.