738 reads

To be Relevant or not to be: a Search Story about Precision and Recall

by Bruno MarquiéOctober 5th, 2020

Too Long; Didn't Read

Elasticsearch and Solr are the most popular and widely used full-text search engines. They are both built on top of Apache Lucene, a high-performance, full-featured text search engine library written in Java. A document is what is stored by the search engine, which is nothing more than a set of properties (named fields) that describe an object. In the next part, we will focus mainly on this last aspect, defining the notion of relevancy, how to measure it, and how to improve it.

Companies Mentioned

featured image - To be Relevant or not to be: a Search Story about Precision and Recall

With the amount of data created growing exponentially each year and forecasted to reach 59 zettabytes in 2020 and more than 175 zettabytes by 2025, the importance of discovering and understanding this data will continue to be, even more than before, a decisive and competitive differentiator for many companies.

In this context, it’s not a surprise that search is a central feature for many content-oriented applications.

Building a search engine (or providing search capabilities) requires to consider multiple aspects:

the heterogeneous nature of the content: different sources of data, content types, or languages. As always the quality of the stored data is key and the returned results can only be as good as them. This is a well-known postulate for any ML application.

performances: returning search results fast is considered a given for many end-users now used to Google. Improving and constantly measuring the freshness of the data (reducing the time between the moment the content has been created/changed and the moment that it is searchable) also needs to be addressed. Trying to balance both brings its share of challenges.

scalability: with this amount of data generated each day, and an always growing userbase, this last aspect is central too.

Returning the information quickly to any end-user is important. However, speed and performance are not enough. Returning the right information and being relevant is just as important. The quality of the returned results makes the difference between a productive and a frustrated end-user.

In the next parts, we will focus mainly on this last aspect, defining the notion of relevancy, how to measure it, and how to improve it.

What’s going on under the hood?

Before deep-diving into the concept of relevancy, some understanding of how a search engine works is needed.

In this post, we focus on what we name a full-text search engine. Elasticsearch and Solr are the most popular and widely used full-text search engines. They are very similar in term of features as they are both built on top of Lucene:

Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Even if not 100% accurate, a full-text search engine analyzes, stores text content and helps find this content by search by keyword. It’s ideally used to store and query unstructured content, whereas a SQL DB is used to store and query structured content. Now although Lucene is specialized in analyzing unstructured text content, elasticsearch and solr provide an abstraction that not only supports structured text content but also non-text content (date, integer …)

A document is what is stored by the search engine, which is nothing more than a set of properties (named fields) that describe an object: to make a parallel with a database, if we wanted to store a DB table in elasticsearch, we could create one document per record with one field per column.

Elasticsearch’s documents are represented using json.

The following examples are 2 documents, each one representing a book (i.e. the DB record or the object). Each document has 2 fields (DB columns or the object’s properties) “title” and “description”, each one containing raw text. Note that usually, we would store the content of the book as an additional field that could be named “content”. For the sake of clarity, we kept these examples concise.

As a side note, we often hear a difference being made between metadata and content. This is misleading, everything is a field, the “main” content can be stored in a different text field than the other properties (this metadata). But the end goal is to search for “documents” having one or several “fields” containing some keywords: e.g. “Find all the documents with ‘ramen’ in the “title”

In the domain of Information Retrieval, the set of documents is referred to as “corpus”.“ Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). “

— Introduction to Information Retrieval, Cambridge University Press. 2008.

Now that we understand that the main task of a full-text search engine is to store documents and find them using keywords, let’s correct something: the search engine does not store documents (most of the time) but indexes.

Documents are used to build an index to speed up query execution. Instead of the regular b-tree indices used by standard databases to match records, a full-text search engine builds a specific index to match the right documents at query time: an inverted index. Each key of the inverted index is a term (word of a text field) that references the list of documents that contains it. For a given document, we have one inverted index per field.

This index is called “inverted” because the keys are the terms and they reference a list of documents (named the posting list) and not the opposite.

One other major advantage of building this index is also to help infer what is named term vectors. This time there is one term vector per document per field, and for each term in the field it gives some statistics that can be used to calculate later how relevant the document is at query time, like:

how often this term occurs in the document (Term Frequency)

and the number of documents containing the current term (Document Frequency).

Here is one part of the term vector returned by elasticsearch for the field title of the first document.

We will look again later at TF and DF, but now we have enough bases to start talking about relevancy.

What is Search Relevance?

As already mentioned above, returning results fast is not enough if they are not relevant.

In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.

— Wikipedia

The notion of what is relevant can differ per user, and for this reason, there is not a unique way to define it. But we can isolate a set of expectations and reasons that show why returning relevant results to the end-user is important.

The results returned to the end-user should:

help solve their problem

help them to be more productive and not require them to spend more time searching

match what they were looking for and understand why it was returned.

Returning the right and most relevant information makes the difference between a loyal end-user who adopts the application and comes back for more and a frustrated end-user who ultimately gives up.

Making the end-user happy is our number one priority.

If you can’t measure it, you can’t improve it.
— Peter Drucker

In the domain of IR, Precision and Recall are two main metrics often used to measure the search relevance.

Although precision and recall can focus on the top results, they do not capture one main aspect of the search behavior: the user is not only interested in finding any relevant documents, they want the most relevant ones. Relevancy is not binary. Some things are more relevant than others. The goal of a search engine is to find and rank the results.

Measuring the quality of the ranking can be done using a different set of metrics, like Average Precision@K or Mean Reciprocal Rank(MRR) or Normalized Discounted Cumulative Gain (NDCG). These rank-based metrics weigh the relevance of each result based on its position. The last one has the advantage of capturing in which position the results are supposed to be.

Finally, there is a set of signals that also help assess the quality of the user search experience and that we aim to optimize:

number of zero result queries: indicates a recall issue,

0 click queries indicate a precision issue,

number of clicks per search cycle: indicate how fast the user reach the right info

most popular queries

average rank for most popular queries

Why and how to improve relevancy?

At this point, we understand that relevancy can be tracked and measured by several metrics. There are different approaches to try improving them.

Let’s look first at precision and recall. As already mentioned before, these 2 work hand in hand and balance each other. Depending on the searcher, the balance to find can be different:

improving recall will help exploration and the users that do not know what they want: the users who are in discovery mode, do not have a good understanding of the available documents and do not know what they are searching for. As the recall improves, the precision decreases.
improving precision will help the users who want to find a document that they know is present: they know what this document is about and maybe already accessed it before. They have one goal to complete and want to do it fast.

So, the one that we should prioritize depends on the searcher and how well we can understand and guide them. The sweet spot is different for each user, building a one fits all solution is difficult, being specific brings the risk of overfitting.

Neither option is better and to complicate things, user’s tastes and goals can change. So there is no silver bullet, understanding the user and the domain is key. Giving them more control over the query that is built (through facets or query suggestions for example), will help them get access to what is relevant for them.

However, there exist different ways to improve each one of them:

the precision can be improved by adding more specific terms to the query, introducing some curated and highly specific fields to the documents, and scoping more specifically the query to some fields, leveraging different tokenization strategies, using phrase search or proximity search…

the recall increases when the size of the matching result set grows, meaning that the query matches more results: using query rewriting mechanisms (adding more generic query terms or removing specialized query terms), but also synonyms or stemming.

As already mentioned, these 2 metrics do not reflect the ranking aspects of search. Computing a score allows us to compare documents’ relevancies together and ultimately order them. When trying to improve precision, we mentioned that we could specifically target some documents’ fields. Deciding how much importance a field has compared to another will help come up with a better score.

When a document matches a query, a score is computed by the search engine. One big part of this score is what we name TF/IDF that starts from the idea that the more times a document contains a term, the most likely it is to be about that term.

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general

— Wikipedia

There are different ways to compute TFIDF and as we mentioned earlier, TF and DF statistics are computed at indexing time and can be retrieved by looking at the terms vectors.

We end up having a TFIDF score per field per query term that we can weight differently as part of the overall score. Some other criteria can also impact this final score, like for example the freshness of the document or some external popularity score.

Here is a simple way to tune how the search engine computes the overall score: the first query does not attempt to change anything and just relies on the standard oob score computation algorithm.

The second one gives 2 times more importance when the query matches the title to change the ranking order.

Testing & Tuning Search

The first approach to test all these possible improvements and models is to rely on a test set or more precisely what we call a (human) judgment list: this list records which documents should match which queries and establish the ideal ordering. Building and maintaining such a list can be expensive and requires domain expertise, but this is the most accurate way to measure how precision and recall improves.

Another approach is to rely on the users’ actions and use these behaviors (clicks, more specific actions like comment or print, or conversions) as implicit feedback. By using this feedback, and weighting them differently, we can rebuild an implicit judgment list. They are less accurate than an explicit one, but easier to generate and fine-tune.

The different metrics listed above help to measure how much progress is made and helps us make educated decisions.

Improving search relevancy is a challenging process considering:

the different users’ behaviors and goals,

the need to balance between precision and recall,

the scoring function,

the bias coming from the UI,

the different times it can be impacted: query time (query expansion), analyzing and indexing (tokenization, enrichment, stemming…),

the dependency on the domain (synonyms, ….)

Crafting an optimal scoring function requires experience and can easily become a whack-a-mole game if all the infrastructure is not present to offline test and prevent regressions.

As well as testing, coming up with different options is sometimes a long journey of trial and error:

what query terms are important?

should we prefer popular results?

how much popular?

how much recent?

This phase can also be addressed by machine learning: coming up with the right model, features and values can be part of a Learning to Rank approach

Learning to rank[1] or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems

— Wikipedia

It offers a way to optimize constantly the relevance of the results over time using realtime usage by improving gradually the precision and recall metrics.

Baby steps are the key to success

As with any software engineering application, this is all about incremental delivery and building foundations.

Some lessons learned from the trenches, building a highly relevant search experience often requires to:

put in place a flexible and pluggable pipeline that allows customizing the way data are cleaned and indexed,

address first 50%-80% of the queries’ needs instead of shooting for the last 20%,

make data-driven decisions quickly and having the right testing and monitoring infrastructure in place,

manually tuning first, in order to be responsive to keep the end-user engaged

understand the searcher needs and intent,

help them navigate through the pitfalls, figure out what’s going on, and always give them an option to bypass the system, craft their query to find what they were looking for,

record, improve, test offline

and finally, look at this sweet machine learning approach.

To keep and ultimately grow the userbase, improving relevancy is on par with performances. The more relevant the search results are and the faster the user finds what they want, the less they will have to interact with it, and so the less overloaded and more performant the system will be.