With the amount of data created growing exponentially each year and forecasted to reach 59 zettabytes in 2020 and more than 175 zettabytes by 2025, the importance of discovering and understanding this data will continue to be, even more than before, a decisive and competitive differentiator for many companies.
In this context, it’s not a surprise that search is a central feature for many content-oriented applications.
Building a search engine (or providing search capabilities) requires to multiple aspects:
Returning the information quickly to any end-user is important. However, speed and performance are not enough. Returning the right information and being relevant is just as important. The quality of the returned results makes the difference between a productive and a frustrated end-user.
In the next parts, we will focus mainly on this last aspect, defining the notion of relevancy, how to measure it, and how to improve it.
Before deep-diving into the concept of relevancy, some understanding of how a search engine works is needed.
In this post, we focus on what we name a full-text search engine. Elasticsearch and Solr are the most popular and widely used full-text search engines. They are very similar in term of features as they are both built on top of Lucene:
Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Even if not 100% accurate, a full-text search engine analyzes, stores text content and helps find this content by search by keyword. It’s ideally used to store and query unstructured content, whereas a SQL DB is used to store and query structured content. Now although Lucene is specialized in analyzing unstructured text content, elasticsearch and solr provide an abstraction that not only supports structured text content but also non-text content (date, integer …)
A document is what is stored by the search engine, which is nothing more than a set of properties (named fields) that describe an object: to make a parallel with a database, if we wanted to store a DB table in elasticsearch, we could create one document per record with one field per column.
Elasticsearch’s documents are represented using json.
The following examples are 2 documents, each one representing a book (i.e. the DB record or the object). Each document has 2 fields (DB columns or the object’s properties) “title” and “description”, each one containing raw text. Note that usually, we would store the content of the book as an additional field that could be named “content”. For the sake of clarity, we kept these examples concise.
As a side note, we often hear a difference being made between metadata and content. This is misleading, everything is a field, the “main” content can be stored in a different text field than the other properties (this metadata). But the end goal is to search for “documents” having one or several “fields” containing some keywords: e.g. “Find all the documents with ‘ramen’ in the “title”
In the domain of Information Retrieval, the set of documents is referred to as “corpus”.“ Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). “
— Introduction to Information Retrieval, Cambridge University Press. 2008.
Now that we understand that the main task of a full-text search engine is to store documents and find them using keywords, let’s correct something: the search engine does not store documents (most of the time) but indexes.
Documents are used to build an index to speed up query execution. Instead of the regular b-tree indices used by standard databases to match records, a full-text search engine builds a specific index to match the right documents at query time: an inverted index. Each key of the inverted index is a term (word of a text field) that references the list of documents that contains it. For a given document, we have one inverted index per field.
This index is called “inverted” because the keys are the terms and they reference a list of documents (named the posting list) and not the opposite.
One other major advantage of building this index is also to help infer what is named term vectors. This time there is one term vector per document per field, and for each term in the field it gives some statistics that can be used to calculate later how relevant the document is at query time, like:
Here is one part of the term vector returned by elasticsearch for the field title of the first document.
We will look again later at TF and DF, but now we have enough bases to start talking about relevancy.
As already mentioned above, returning results fast is not enough if they are not relevant.
In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.
— Wikipedia
The notion of what is relevant can differ per user, and for this reason, there is not a unique way to define it. But we can isolate a set of expectations and reasons that show why returning relevant results to the end-user is important.
The results returned to the end-user should:
Returning the right and most relevant information makes the difference between a loyal end-user who adopts the application and comes back for more and a frustrated end-user who ultimately gives up.
Making the end-user happy is our number one priority.
In the domain of IR, Precision and Recall are two main metrics often used to measure the search relevance.
Although precision and recall can focus on the top results, they do not capture one main aspect of the search behavior: the user is not only interested in finding any relevant documents, they want the most relevant ones. Relevancy is not binary. Some things are more relevant than others. The goal of a search engine is to find and rank the results.
Measuring the quality of the ranking can be done using a different set of metrics, like Average Precision@K or Mean Reciprocal Rank(MRR) or Normalized Discounted Cumulative Gain (NDCG). These rank-based metrics weigh the relevance of each result based on its position. The last one has the advantage of capturing in which position the results are supposed to be.
Finally, there is a set of signals that also help assess the quality of the user search experience and that we aim to optimize:
At this point, we understand that relevancy can be tracked and measured by several metrics. There are different approaches to try improving them.
Let’s look first at precision and recall. As already mentioned before, these 2 work hand in hand and balance each other. Depending on the searcher, the balance to find can be different:
So, the one that we should prioritize depends on the searcher and how well we can understand and guide them. The sweet spot is different for each user, building a one fits all solution is difficult, being specific brings the risk of overfitting.
Neither option is better and to complicate things, user’s tastes and goals can change. So there is no silver bullet, understanding the user and the domain is key. Giving them more control over the query that is built (through facets or query suggestions for example), will help them get access to what is relevant for them.
However, there exist different ways to improve each one of them:
As already mentioned, these 2 metrics do not reflect the ranking aspects of search. Computing a score allows us to compare documents’ relevancies together and ultimately order them. When trying to improve precision, we mentioned that we could specifically target some documents’ fields. Deciding how much importance a field has compared to another will help come up with a better score.
When a document matches a query, a score is computed by the search engine. One big part of this score is what we name TF/IDF that starts from the idea that the more times a document contains a term, the most likely it is to be about that term.
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general
— Wikipedia
There are different ways to compute TFIDF and as we mentioned earlier, TF and DF statistics are computed at indexing time and can be retrieved by looking at the terms vectors.
We end up having a TFIDF score per field per query term that we can weight differently as part of the overall score. Some other criteria can also impact this final score, like for example the freshness of the document or some external popularity score.
Here is a simple way to tune how the search engine computes the overall score: the first query does not attempt to change anything and just relies on the standard oob score computation algorithm.
The second one gives 2 times more importance when the query matches the title to change the ranking order.
The first approach to test all these possible improvements and models is to rely on a test set or more precisely what we call a (human) judgment list: this list records which documents should match which queries and establish the ideal ordering. Building and maintaining such a list can be expensive and requires domain expertise, but this is the most accurate way to measure how precision and recall improves.
Another approach is to rely on the users’ actions and use these behaviors (clicks, more specific actions like comment or print, or conversions) as implicit feedback. By using this feedback, and weighting them differently, we can rebuild an implicit judgment list. They are less accurate than an explicit one, but easier to generate and fine-tune.
The different metrics listed above help to measure how much progress is made and helps us make educated decisions.
Improving search relevancy is a challenging process considering:
Crafting an optimal scoring function requires experience and can easily become a whack-a-mole game if all the infrastructure is not present to offline test and prevent regressions.
As well as testing, coming up with different options is sometimes a long journey of trial and error:
This phase can also be addressed by machine learning: coming up with the right model, features and values can be part of a Learning to Rank approach
Learning to rank[1] or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems
— Wikipedia
It offers a way to optimize constantly the relevance of the results over time using realtime usage by improving gradually the precision and recall metrics.
As with any software engineering application, this is all about incremental delivery and building foundations.
Some lessons learned from the trenches, building a highly relevant search experience often requires to:
To keep and ultimately grow the userbase, improving relevancy is on par with performances. The more relevant the search results are and the faster the user finds what they want, the less they will have to interact with it, and so the less overloaded and more performant the system will be.
Found this article useful? Follow me on Hackernoon, Medium, Twitter, and Linkedin!