Hello everyone! My name is Vlad Vinogradov, I am the head of the computer vision department at EORA.ai. We have been engaged in deep learning for over three years and during that time we have implemented many projects for both Russian and international clients, which included the research and training of models. Recently, we have focused on solving the problems of finding similar images and at the moment have created systems for finding logos, drawings, furniture, clothing, and other goods. This post is for Machine Learning Engineers and is based on my talk Finding Similar Images - . A Guide from A to Z, which was published by the Open Data Science community at Data Fest Online 2020 This article provides background information on the recommended methods used in the Image Recovery task. After reading the article, you will be able to create a search engine for similar images for your objective from scratch (not counting the process of developing a production solution). About the project A Similar Image Retrieval (aka Content-Based Image Retrieval or CBIR) is any search that involves images. The picture above from the article will tell you about the problem most easily. Recent Advance in Content-based Image Retrieval: A Literature Survey (2017) Nowadays, the "Search by photo" approach is being used actively, in particular, in e-commerce services (AliExpress, Wildberries, etc.). "Keyword search" (with an understanding of the content of images) has long settled in the search engines Google, Yandex, etc., but has not yet reached marketplaces and other private search engines. I think that since the appearance of the sensational in computer vision circles, the globalization of this approach will accelerate. CLIP: Connecting Text and Images Since our team specializes in neural networks in computer vision, in this article I will focus only on the Search by Photo approach. Basic service components The model can be made on the classic CV or on the basis of a neural network. Model input - image, output - D-dimensional descriptor/embedding. In the case of the classics, it can be a combination of SIFT-descriptor + Bag of Visual Words. In the case of a neural network, a standard backbone like ResNet, EfficientNet, etc. + intricate pooling layers + clever learning techniques, which we will talk about later. I can say that if you have enough data or good practice, neural networks will almost always benefit greatly (we checked), so we will focus on them. Step 1. Train the model. Indexing is a running of the trained model on all images and writing embeddings to a special index for quick search. Step 2. Indexing the base of images. Using the image uploaded by the user, the model is run, the embedding is obtained and this embedding is compared with the rest in the database. The search result is the search results sorted by relevance. Step 3. Search. Neural Networks and Metric Learning The neural network in the task of finding similarities is used as a feature extractor (backbone). The choice of backbone depends on the volume and complexity of the data - you can consider everything from to . ResNet18 Visual Transformer The first feature of the models in Image Retrieval is the magic in the head of the neural network. On the__ __, they are fighting for building the best descriptors - there are Combined Global Descriptors with parallel pools and Batch Drop Block for a more even distribution of activation over the output feature map. Image Retrieval leaderboard The second main feature is the loss functions. There are a lot of them. In alone, there are over a dozen recommended pair losses. There are also the same numbers of classification ones. The main point of all these losses is to train the neural network to transform the image into a vector of linearly separable space, so that further these vectors can be compared by cosine or Euclidean distance: similar images will have close embeddings, dissimilar images will have distant ones. Let's take a closer look. Deep Image Retrieval: A Survey Loss Functions Contrastive Loss This is a double loss, i.e. objects are compared by the distance between each other. The neural network is penalized for the distance from each other of the embeddings of images p and q if these images are actually similar. Similarly, there is a penalty for the proximity of embeddings, the images of which are actually different from each other. In this case, in the latter case, we set the boundary m (for example, 0.5), overcoming which, we believe that the neural network has coped with the task of "separating" dissimilar images. Triplet Loss Here we aim to minimize the distance from the anchor to the positive and maximize the distance from the anchor to the negative. Triplet Loss was first introduced in Google's on facial recognition and has long been a state-of-the-art solution. FaceNet article N-tupled Loss N-tupled Loss- a development of Triplet Loss, which also takes anchor and positive, but uses multiple negatives instead of one negative. Angular Additive Margin (ArcFace) The problem of paired losses is in choosing combinations of positives, negatives, and anchors - if they are just taken uniformly random from the dataset, then the problem of "light pairs" arises. These are such simple pairs of images for which the loss will be 0. It turns out that the network converges quickly enough to a state in which most of the elements in the batch will be "easy" for it, and the loss for them will turn out to be zero - the network will stop learning. To avoid this problem, they began to come up with sophisticated mining techniques for pairs - hard negative and hard positive mining. You can read more about the problem in . There is also a , which implements many mining methods, and in general, the library contains a lot of useful information on the Metric Learning task on PyTorch. this article PML library Another solution to the problem is classification losses. Consider one popular bug feature that led to state-of-the-art facial recognition three years ago - . ArcFace The main idea is to add an indent m to the usual cross-entropy, which distributes the embeddings of images of one class in the region of the centroid of this class so that they are all separated from the clusters of embeddings of other classes by at least an angle m. This seems to be the perfect loss feature, especially when you look at the . But you need to keep in mind that it will work only if there is a classification markup. If you don't have one, you will have to work with paired losses. MegaFace benchmark Here I visually show you which loss functions are best used when you have single-class and multi-class markup (you can derive paired markup from the latter by counting the percentage of intersection between multilabel vectors of examples). Pooling Let's go back to the neural network architecture and consider a couple of pooling layers used in Image Retrieval tasks. R-MAC is a pooling layer that accepts the output map of the neural network (before the global pooling or classification layers) and returns a descriptor vector calculated as the sum of activations in different windows of the output map. Here, the activation of the window is to take the maximum for this window for each channel independently. Regional Maximum Activation of Convolutions (R-MAC) The resulting descriptor takes into account the local features of the image at different scales, thereby creating a rich feature description. This descriptor itself can be an embedding, so it can be sent directly to the loss function. Generalized Mean Generalized Mean (GeM) is a simple pooling that can improve the quality of the output descriptor. The bottom line is that the classical average pooling can be generalized to the lambda norm. By increasing the lambda, we make the network focus on significant parts of the image, which can be important in certain tasks. Ranging Index The key to a high-quality search for similar images is ranking, i.e. displays the most relevant examples for a given query. It is characterized by the speed of building the descriptor index, the speed of the search, and the memory consumed. The simplest thing is to keep the embeddings "head-on" and do a brute-force search on them, for example, using the cosine distance. Problems appear when there are a lot of embeddings - millions, tens of millions, or even more. The search speed is significantly reduced, the amount of heap occupied increases. One positive thing remains - the search quality is perfect with the existing embeddings. These problems can be solved at the expense of quality - to store embeddings, not in their original form, but compressed (quantized). And also to change the search strategy - not to search for brute-force, but to try for the minimum number of comparisons to find the required number of those closest to the given query. There are a large number of efficient frameworks for approximate search for the nearest ones. A special has been created for them, where you can see how each library behaves on different datasets. benchmark Most Popular: , , , . Also, if you want to take indexing with the REST API out of the box, you can consider the . NMSLIB Spotify Annoy Facebook Faiss Google Scann Jina application Re-ranking Researchers in the Information Retrieval field have long understood that ordered search results can be improved by some way of reordering items after the original search results have been received. One such method is Query Expansion. The idea is to use the top-k of the closest elements to generate a new embedding. In the simplest case, you can take the averaged vector, as shown in the picture above. You can also weight embeddings, for example, by distance in the issue or cosine distance from the request. Such improvements are described in a single framework in the article . Optionally, you can apply Query Expansion recursively. Attention-Based Query Expansion Learning k-reciprocal k-reciprocal - a set of elements from top-k, including the k nearest of which is the request itself. On the basis of this set, the process of re-ranking the results is built, one of which is described in the article . By definition, k-reciprocal is closer to the query than k-nearest neighbors. Accordingly, one can roughly consider the elements that are in the k-reciprocal set as knowingly positive and change the weighting rule, for example, for Query Expansion. In this article, a mechanism has been developed for recalculating distances using k-reciprocal sets of the elements themselves in top-k. The article contains a lot of calculations, this is beyond the scope of this post, so I suggest the readers familiarize themselves with it. Re-ranking Person Re-identification with k-reciprocal Encoding Validation We come to the part of checking the quality of the search for similar ones. There are many subtleties in this task that may not be noticed by beginners when they first start working on an Image Retrieval project. Metrics Let's consider such popular metrics in the Image Retrieval task: precision@k, recall@k, R-precision, mAP и nDCG. precision@R Pros Shows the percentage of top-k responses that are relevant. Cons very sensitive to the number of relevant examples for a given query, which does not allow an objective assessment of the quality of search, where there are different numbers of relevant examples for different queries reaching the value 1 is possible only if the number of relevant >= k for all queries R-precision Same as precision@k, where k is set equal to the number of relevant queries. Pros sensitivity to the number k in precision@k disappears, the metric becomes stable Cons you have to know the total number of relevant ones to the request (it can be a problem if not all relevant ones are marked up) Recall@k Shows what proportion of relevant items were found in top-k Pros answers the question of whether found relevant in principle among the top-k stable and well averaged over requests mAP (mean Average Precision) Shows how densely we fill the top of the search results with relevant examples. You can look at this as the amount of information received by the user of the search engine, which has read the least number of pages. Accordingly, the greater the amount of information to the number of pages read, the higher the metric. Pros objective stable assessment of search quality is a one-digit representation of the precision-recall curve, which itself is rich in information for analysis Cons you have to know the total number of relevant ones to the request (it can be a problem if not all relevant ones are marked up) You can read more about metrics in Information Retrieval, including the mAP output, here nDCG (Normalized Discounted Gain) This metric shows how correctly the elements in top-k are ordered among themselves. We will not consider the pros and cons of this metric, since this is the only metric in our list that takes into account the order of elements. However, there are studies showing that this metric is fairly stable when it is necessary to take into account the order and can be suitable in most cases. Estimation a metric is calculated for each request, averaging over all requests macro: pros: there are no significant fluctuations with a different number of relevant to this query cons: all queries are considered equal, even if some are more relevant than others the number of tagged relevant and separately successfully found relevant is summed over all queries, then participates in the fraction of the corresponding metric micro: pros: queries are evaluated taking into account the number of tagged relevant for each of them cons: the metric can become very low / very high if there were a lot of tagged relevant ones for some request and the system unsuccessfully / successfully brought them to the top Validation system proposing to consider two options for validation Validation on a set of queries and selected relevant ones Input: image requests and images relevant to them. There is markup in the form of a list relevant to this query. To calculate metrics, you can calculate the relevance matrix each with each and, based on the binary information about the relevance of the elements, calculate the metrics. Full-base validation Input: image requests, and images relevant to them. There should also be a validation database of images, in which, ideally, all relevant queries are marked. Also, it should not contain query images, otherwise, they will have to be cleaned at the search stage so that they do not clog the top-1. The validation base is involved as a base of negatives - our task is to draw out those that are relevant to it. To calculate metrics, you can go through all the requests, calculate the distances to all elements, including relevant ones, and send them to the metrics calculation function. Completed project example Some companies argue with other companies that the latter should not use the pictorial elements of the former's brand. In such cases, a weaker manufacturer tries to parasitize on the success of a large brand by presenting its products and services under similar symbols. Users also suffer from this - you can mistakenly buy cheese from the wrong manufacturer, which you already trust, but take a fake product without carefully reading the label. An example from a recent one: Apple Is Attempting to Block a Pear Logo Trademark There are special companies, governmental and private, to fight illegal immigration. They have a register of registered trademarks by which they can compare new incoming trademarks and approve / reject an application for the registration of a trademark. Above is an example of the interface of the system. In such systems, a search for similar images will be a good helper - the expert will be able to find analogs faster. WIPO Examples of how our system works The size of the indexed database of images: several million trademarks. Here the first image is a query, on the next line is a list of expected relevant ones, the rest of the lines are what the search engine gives out in order of decreasing relevance. Conclusion That's all. It was overview material. Hopefully, those who are building or about to build similar image search engines have gained some benefit. And vice versa, if you think that I am wrong somewhere, tell me in the comments, I will be glad to receive feedback.