Analyzer in Elasticsearch: An Introduction

Written by brilianfird | Published 2020/11/23
Tech Story Tags: elasticsearch | programming | software-development | analyzer-in-elasticsearch | tutorial | latest-tech-stories | indexing | querying | web-monetization

TLDR A good search engine is a search engine that returns relevant documents. Elasticsearch uses Analyzer to control which documents are more relevant when querying. Inverted Index is a data structure for storing a mapping between token to document identifiers that have the term. An Analyzer will first transform and split the text into tokens before saving it to the Inverted index. The analyzer will affect how we search the text, but it won’t affect the content of the text itself. The Elasticsearch will only get the documents with the same term as the one queried.via the TL;DR App

If we want to create a good search engine with Elasticsearch, knowing how Analyzer works is a must. A good search engine is a search engine that returns relevant results. When the user queried something in our Search Engine, we need to return the documents relevant to the user query.
One component we can tune so Elasticsearch can return relevant documents is Analyzer. Analyzer is a component responsible for processing the text we want to index and is one component that control which documents are more relevant when querying.

A bit about Inverted Index

Since Analyzer correlates tightly to Inverted Index, we need to understand about what Inverted Index is first.
Inverted Index is a data structure for storing a mapping between token to the document identifiers that have the term. Other than document identifiers, the Inverted Index also stores the token position relative to the documents. Since Elasticsearch map the tokens with document identifiers, when you do a query to Elasticsearch, it can easily get the documents you want and returns the documents quick.
Indexing documents into Inverted Index
Let’s say that we want to index 2 documents:
  • Document 1: “Elasticsearch is fast”
  • Document 2: “I want to learn Elasticsearch”
Let’s take a peek into the Inverted Index and see the result of the Analysis and Indexing process:
As you can see, the terms are counted and mapped into document identifiers and its position in the document. The reason we don’t see the full document “Elasticsearch is fast” or “I want to learn Elasticsearch” is because they go through Analysis process, which is our main topic in this article.
Querying into Inverted Index
There is one thing to note regarding querying to Inverted Index. The Elasticsearch will only get the documents with the same term as the one queried.
We can easily test this by using two types of Elasticsearch’s query, Match Query and Term Query. Basically, the Match Query will go through an Analysis process while Term Query won’t. if you’re interested in the difference between them, you can read in my other articles “Elasticsearch: Text vs. Keyword
If you try to do a Term Query “Elasticsearch” to the index in the example above, you won’t get any result. This happens because the token in the Inverted Index is “elasticsearch” with lowercase “e”. While when you try the same using Match Query, Elasticsearch will analyze the query into “elasticsearch” before searching in the Inverted Index. Hence, the query will return results.

What is Analyzer in Elasticsearch?

When we insert a text document into the Elasticsearch, the Elasticsearch won’t save the text as it is. The text will go through an Analysis process performed by an Analyzer. In the Analysis process, an Analyzer will first transform and split the text into tokens before saving it to the Inverted Index.
For example, inserting “Let’s build an Autocomplete!” to the Elasticsearch will transform the text into 4 terms, “let’s”, “build”, “an”, and “autocomplete”.
The analyzer will affect how we search the text, but it won’t affect the content of the text itself. With the previous example, if we search for “let”, the Elasticsearch will still return the full text “Let’s build an autocomplete!” instead of only “let”.

Elasticsearch’s Analyze API

Elasticsearch provide a very convenient API that we can use to test and visualize analyzer:
This API will ease our analyzer’s debugging process by much. We will use it a lot in this article.

Elasticsearch Analyzer Components

Elasticsearch’s Analyzer has three components you can modify depending on your use case:
  • Character Filters
  • Tokenizer
  • Token Filter
Character Filters
The first process that happens in the Analysis process is Character Filtering, which removes, adds, and replaces the characters in the text.
There are three built-in Character Filters in Elasticsearch:
  • HTML Strip Character Filters: Will strip out html tag and characters like <b>, <i>, <div>, <br />, et cetera.
  • Mapping Character Filters: This filter will let you map a term into another term. For example, if you want to make the user can search an emoji, you can map “:)” to “smile”
  • Pattern Replace Character Filter: Will replace a regular expression pattern into another term. Be careful though, using Pattern Replace Character Filter will slow down your documents indexing process.
Tokenizer
After Character Filtering process, our text proceeds to the Tokenization process by a Tokenizer. Tokenization splits your text into tokens. For example, previously we transformed “Let’s build an autocomplete” to terms “let’s”, “build”, “an”, and “autocomplete”. The transformation process of splitting the text into 4 tokens is done by Tokenizer.
There are too many Tokenizer to write in this article in the Elasticsearch. If you’re interested, you can find the list in the Elasticsearch Documentation.
Some of the most common used Tokenizer are:
  • Standard Tokenizer: Elasticsearch’s default Tokenizer. It will split the text by white space and punctuationWhitespace
  • Tokenizer: A Tokenizer that split the text by only whitespace.
  • Edge N-Gram Tokenizer: Really useful for creating an autocomplete. It will split your text by white space and characters in your word. e.g. Hello -> “H”, “He”, “Hel”, “Hell”, “Hello”.
Note that you need to be careful with Tokenizer because too many of it would slow down your insert process.
Token Filter
Token Filtering is the third and the ending process in Analysis. This process will transform the tokens depending on the Token Filter we use. In Token Filtering process, we can lowercase, remove stop words, and add synonyms to the terms.
There are also so many Token Filter in the Elasticsearch which you can also read on their documentation.
he most common usage of Token Filter is lowercase token filter which will lowercase all your tokens.

Standard Analyzer

Standard Analyzer is the default analyzer of the Elasticsearch. If you don’t specify any analyzer in the mapping, then your field will use this analyzer. It uses grammar based Tokenization specified in https://unicode.org/reports/tr29/, and it works pretty well with most language.
The standard analyzer uses:
  • Standard Tokenizer
  • Lower Case Token Filter
  • Stop Token Filter (disabled by default)
So with those components, it basically does:
  • Tokenize the text into tokens by white space and punctuation
  • Lowercase the tokens
  • If you enable Stop Token Filter, it will remove stop words
Let’s try explaining a document “Let’s learn about Analyzer!” with Standard Analyzer:
We can see that Standard Analyzer split the text into tokens by white space. It also removes the punctuation “!” because there is no more token after it. We can also see that all the tokens are lower cased because Standard Analyzer uses Lower Case Token Filter.
In my previous article, “Create a Simple Autocomplete With Elasticsearch”, we only use the Standard Analyzer and can achieve creating a simple autocomplete. By using a custom analyzer that contain our chosen Character Filters, Tokenizer, and Token Filter, we can make a more advanced autocomplete that will produce more relevant results.

Custom Analyzer

Custom Analyzer is an analyzer in which we can define its name and components according to what we want.
To create a custom analyzer, we have to define it in our Elasticsearch settings, which we can do when creating an index:
We just created a custom analyzer with Whitespace Tokenizer, html_strip character filter, and lowercase filter. But is there any way to test the analyzer before we use it in our index?
Let’s try it out using text <b>Let’s build an autocomplete!</b> and compare it with standard analyzer:
We can see some differences between them:
  1. The results of standard analyzer has 2 b token while cust_analyzer does not. This happens because the cust_analyzer strip away the html tag completely.
  2. The standard analyzer split the text by the white space or special characters like <, >, and ! while the cust_analyzer only split the text by white space
  3. The standard analyzer strip away special characters while cust_analyzer does not. We can see it by the difference of the autocomplete! token.
So everything is as we expected. Our analyzer behave like what we want. Our last step is to apply it to the field by using mapping:
Now, we mapped our standard-text field to use standard analyzer, and cust_analyzer-text to use cust_analyzer.
Let’s index a document <b>Let’s build an autocomplete!</b> and try it out!
Let’s try query b with bool query and see what happens:
The only result is the document we index using standard analyzer because in the standard analyzer we didn’t strip the html tag with html_strip character filter while in cust_analyzer we did.
Let’s try one more query, autocomplete!:
Now, the only result is the document we index using cust_analyzer. Since the standard analyzer split the document into tokens by white space and punctuation, it removes the ! character. The cust_analyzer only split the documents by white space, so the ! is not removed.

Conclusion

Analyzer is an important component you need to learn if you want to create a good Search Engine. Understanding it is the first step to control which documents to show to the users when they query words.
The next step to understand how to create a good Search Engine is to understand the relevance score calculation, boosting, querying and feature to use, which I intend to write too. So, wait for it 🙂
Alas, I want to say thank you for everyone that reads until the end!
Also published here.

Written by brilianfird | A Software Engineer based in Indonesia. Blog: https://codecurated.com
Published by HackerNoon on 2020/11/23