Recently I started using Elasticsearch, which is one of the most known search engine. The things it allows to do fascinated me. That’s why I dove into informational retrieval (IR), the science behind it. It’s a huge topic, and a silver bullet for search engines doesn't exist. It is user needs and expectations that tailors the model and the process to find information. This user-centric vision to measure the quality of an IR system captured me.
That’s why I decided to write and share the IR core concepts.
Speaking about finding data brings to mind databases. At first sight, DBMS seem to overlap with IR systems. After all, DBMS store data and allow users to query them. Despite this similarity they are different and they have different purposes.
DBMS scope is broad. They implement transaction management, data constraints, data independence, etc... They support intensive work and read workloads. The stored data could be structured (e.g., table) or semi-structured (e.g., JSON document). Nonetheless, in both case, we need to model data by choosing attributes or fields we need. This means query (or insertion) depends upon this structure. In other words, a user needs to know the chosen structure before to build a query. And answers to queries are like: “these results exactly match the given query”. They’re precise.
IR systems scope is narrow, and they excel on read-mostly workload. They focus on searching across a large collection of unstructured documents to satisfy an information need. Where a document could be text or multimedia content. While an information need is something a user desire to deepen, and it’s expressed through a query. Terms (e.g., words) compose both documents and queries. The collection of unstructured documents is also known as corpus.
Thanks the unstructured data, users can search across a collection of poems and a collection of medical reports in the same way. Indeed, the collection could be heterogenous about covered topic. An obvious example is a collection of web pages. Given this heterogeneity, IR systems need also to rank documents and sort them according their relevance. Where a relevant document is one that satisfies the information need. And a more relevant one fulfils better the search, so the user. Answers to searches are like: “these results seems to satisfy the information need”. They’re not precise. That’s why users satisfaction measures the IR system quality.
Generally, IR systems can also store and query semi-structured documents (e.g., poems with author). But, the focus stays on unstructured one.
Two metrics measures the quality of an IR system: Precision and Recall.
Precision is the fraction of retrieved documents that are relevant to an information need. Higher precision is desirable. Nonetheless, we need also to consider how many relevant documents aren’t retrieved. That’s what recall measures.
Recall is the fraction of relevant documents successfully retrieved. To achieve maximum recall an IR system can simply return all documents in the collection. That’s why we need precision.
Definitely, precision and recall complement each other.
Suppose an IR system about homogenous and formal documents. Where users state the information need rigourously.
A scenario of this kind is an IR system about exceptions’ log. Each document is precise: it states the exception and the related stack trace. In such system, a user can search for NullPointerException at line 123 of AClassName or at line 42 of AnotherClassName.
The boolean model of information retrieval covers this scenario. It’s the oldest and simplest model. Nonetheless, it could be rather useful.
This model is named in this way because to compose query terms we can use three boolean operators. That is: AND, OR and NOT. A single term makes up the most basic query. And a document is relevant if it satisfies the boolean query. This means a document is either relevant or not. There isn’t ambiguity. In other words, all matching documents have the same relevancy. That’s why this model is useful in a homogenous and formal context.
As an example, we could express the previous information need like: NullPointerException AND (AClassName:123 OR AnotherClassName:42)
. An IR system capable to interpret this query will answer with all exceptions’ log satisfying the query.
To search across a collection of documents we need first to store them. And we need to store them in a way that allows us to search them efficiently.
In the realm of full-text search, the most used data structure is the inverted index. It’s a data structure that maps a term to a list of document ids containing that term. The list is called inverted list or posting list. Thanks its structure, the inverted index allows us to retrieve documents by a term they contain efficiently. So simple, but so powerful!
As an example, suppose two documents. The first one is: “Midway upon the journey of our life”. The second one is: “How heavy do I journey on the way“. The resulting inverted index is:
Term |
Inverted List |
---|---|
Midway |
first |
upon |
first |
the |
first, second |
journey |
first, second |
of |
first |
our |
first |
life |
first |
How |
second |
heavy |
second |
do |
second |
I |
second |
on |
second |
way |
second |
A naive implementation of an inverted index could be based on a hash table. The process through which we save a document in the index is called indexing.
Suppose an IR system about heterogeneous documents. Where users express information need based on the terms they’re interested. Given the heterogeneity, the IR system need to sort documents according their relevance. It also needs to show the most relevant one at the expense of the least relevant. Here, the boolean model cannot help us. After all, it supports only the boolean relevance criteria.
An information need could be as simple as: Apache Kafka. This means the user would like to delve into the distributed event-streaming platform. An IR system that ranks better documents about Kafka (the writer) or Apache (the Native Americans) will disappoint the user. That’s where the vector space model shines.
This model represents document and query as a vector, with one dimension per term. The value of each dimension is the weight the term has. Because the dot product measure similarity between vectors we can use it to rank documents. Indeed, the rank a document has, in relation to a query, is the result of the dot product between the two vectors in question.
As an example, the previous query could be Q = <1, 1>
, where the dimensions are respectively Apache and Kafka. Given the same weights the two terms are of the same importance. As imaginable, the users are responsible to choose the weight of each term of the query. After all, they know what is more important for them. But the responsible to choose the weight of each term of each document is the IR system. Its success is dependant upon this assignment. That’s where ranking functions comes to the rescue. They’re functions that map term to weight.
The dot product definition at Wikipedia: https://en.wikipedia.org/wiki/Dot_product#Coordinate_definition
A naive ranking function maps each term in the document to the weight of one. While it maps missing terms to the weight of zero.
As an example, consider the following two documents:
Now consider a user that would like to study Apache Kafka. It could search across the documents by issuing the following query:
Both documents are related to Apache Kafka. But the first one is more specific, while the second one could be a social media post. So, any user issuing this query will be happier if D1 has a higher rank. But this ranking function doesn’t meet this expectation.
Indeed, an IR system could model the vectors like:
Q = <1, 1, 1>
D1 = <1, 1, 0>
D2 = <1, 1, 0>
Where the first dimension is the term Apache, the second one is Kafka and the third features.
As we said, the first document is the most relevant. But, by calculating the dot product, they have the same rank (i.e., 2).
An issue with this ranking function is that it doesn’t take in account how much a term is frequent. Intuitively, more occurrences of a term means a document covers better a topic. We need to take in account this observation to build a better ranking function.
The vector space model with this ranking function resembles an “OR-only boolean model”. Indeed, a document is relevant if it contains at least one query term. The difference is that here relevancy increases when a document contains more query terms.
The TF (Term Frequency) function maps each term to its frequency in the document.
So, using this function, the previous vectors become:
Q = <1, 1, 1>
D1 = <2, 1, 0>
D2 = <1, 1, 0>
With these weights in place, the D1 rank (i.e., 3) is higher than D2 rank (i.e., 2). It seems to match the user expectation!
However, this function misses another important characteristic. Let’s examine it with another example.
Given the following documents and query:
The user will be happier to see first the D3 document.
Using the TF function, the related vectors are:
D1 = <2, 1, 0>
D3 = <0, 1, 1>
Q = <1, 1, 1>
Where the first dimension is Apache, the second one is Kafka and the third one is compaction. Unfortunately, the D1 rank (i.e., three) is higher than the D3 document (i.e., two). This means a disappointed user.
What the TF function misses is term rarity. In the previous corpus, the Apache and Kafka terms aren’t rare. Indeed, the first one is mentioned two times in the D1 document. While the second one in both documents. This means they doesn’t add too much value to the documents in relation to the information need. But, the compaction term is mentioned only one time in the D3 document. This means that a rarer term characterises better a document. So, we should tune the TF function according this observation. In other words, we need to boost rarer term.
The TF-IDF function maps each term to its weight considering two factors. The first one is the aforesaid TF (Term Frequency). While the second one is the IDF (Inverse Document Frequency).
The IDF describes what we intuitively called rarity. We consider a term more rare if it appears in few documents in the entire corpus. Indeed, IDF is defined as log(N/DF)
. Where N
is the numbers of documents in the corpus. While DF
(Document Frequency) is the numbers of documents where the term we are considering appears. According the definition, IDF tends to zero when DF
approaches N
.
Therefore, the TF-IDF function calculates the weight of a term by multiplying TF and IDF.
Let us repeat the previous query with all the documents considered until now:
Using this function the vectors are:
D1 = <0.35, 0, 0>
D2 = <0.17, 0, 0>
D3 = <0, 0, 0.47>
Q = <1, 1, 1>
Where the first dimension is Apache, the second one is Kafka and the third one is compaction.
Voilà! The D3 ranks (i.e., 0.47) is higher than the D1 rank (0.35). It is also higher than the D2 rank (0.17). As expected the D3 document is the most relevant. And another expected result is that the social media post is the least relevant. Further to notice is that the term Kafka isn’t considered relevant at all. This is because it appears in each document.
Actually there are multiple ways to define TF, IDF and TF-IDF. The definitions presented here are one of the variants. Here is the link to Wikipedia about them: https://en.wikipedia.org/wiki/Tf–idf#Definition. As I anticipated, there isn’t a silver bullet, but a continuous tuning.
As I said initially, what we covered here are the roots of information retrieval science. For instance, we didn't cover indexing optimisations (e.g, stemming). We skipped over on how real search engines mix the aforesaid models. And we also didn't mention the length normalisation process, the BM25 ranking function and the probabilistic model. Yeah, it’s a huge topic.
Despite that, now we have all the knowledge needed to build a working search engine. This fascinates me a lot. Yeah, it’s a huge topic. But also one to experiment a lot with and enjoy.