What is Autocomplete ?
Let’s take a very common example. Whenever you go to google and start typing, a drop-down appears which lists the suggestions. Those suggestions are related to the query and help the user in completing his query.
Suggestions when typing on Google
Autocomplete as the wikipedia says
Autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing
It is also known as Search as you type or Type Ahead Search. It helps in navigating or guiding a user by prompting them with likely completions and alternatives to the text as they are typing it. It reduces the amount of character a user needs to type before executing any search actions, thereby enhancing the search experience of users.
AutoCompletion can be implemented by using any database. In this post, we will use Elasticsearch to build autocomplete functionality.
Elasticsearch is an open source, distributed and JSON based search engine built on top of Lucene.
There can be various approaches to build autocomplete functionality in Elasticsearch. We will discuss the following approaches.
This approach involves using a prefix query against a custom field. The value for this field can be stored as a keyword so that multiple terms(words) are stored together as a single term. This can be accomplished by using keyword tokeniser. This approach has some disadvantages.
This approach involves using different analysers at index and search time. When indexing the document, a custom analyser with an edge n-gram filter can be applied. At search time, standard analyser can be applied. which prevents the query from being split.
Edge N-gram tokeniser first breaks the text down into words on custom characters (space, special characters, etc..) and then keeps the n-gram from the start of the string only.
This approach works well for matching query in the middle of the text as well. This approach is generally fast for queries but may result in slower indexing and in large index storage.
Elasticsearch is shipped with an in-house solution called Completion Suggester. It uses an in-memory data structure called Finite State Transducer(FST). Elasticsearch stores FST on a per segment basis, which means suggestions scale horizontally as more new nodes are added.
Some of the things to keep in mind when implementing Completion Suggester
completiontypes as its field type.
This approach is the ideal approach to implement autocomplete functionality, however, it also has certain disadvantages
americain marvels movie dataset will not yield any result. One way to overcome is tokenizing the input text on space and keep all the phrases as canonical names. This way
Captain America: Civil Warwill be stored as
Highlighting of the matched words are not supported.
Let’s implement the above approaches in Elasticsearch. We will be using Marvels movie data to build our sample index. For easy reference, here is the
We will be creating an index
movies with type
If we see the mapping, we will observe that name is a nested field which contains several field, each analysed in a different way.
name.keywordstringis analysed using a Keyword tokenizer, hence it will be used for Prefix Query Approach
name.edgengramis analysed using Edge Ngram tokenizer, hence it will be used for Edge Ngram Approach.
name.completionis stored as a completion type, hence it will be used for Completion Suggester.
We will index all our movies by using
Let’s start with Prefix Query approach and try finding movie beginning with
Query will be
This will result in the following movie
The result is fair, but some movies like Captain America: The Winter Soldier, Guardians of the Galaxy are missed because prefix query only matches at the beginning of the text and not in the middle.
Lets try finding another movie beginning with
Here we do not get any results, although Captain America satisfy this condition. This confirms the point that Prefix query cannot be used to match in the middle of the text.
Let's run the same search
am but with Edge Ngram Approach.
Here we get the following result
Let’s try finding for Captain America again, but this time with a bigger phrase
captain america the
Using Edge N-gram approach, we get the following movies
If we observe our phrase, only the first two suggestion makes sense. The reason for so many terms getting matched is the functioning of
match clause. match includes all the documents which contain
captain OR america OR the. Since the field is analysed using ngram, more suggestions(if present) will get included as well.
Let’s try using the suggestion query for the same phrase
captain america the . Suggestion query is written in a slightly different way.
We get the following movies as result
Let’s try the same query, but this time with a typo
captain amrica the.
movie-suggest returns no result because no support for fuzziness is present. We can update the query to include support for fuzziness in the following way
The above query returns the following results
Various approaches can be used to implement autocomplete functionality in ElasticSearch. Completion Suggester covers most of the cases which are required in implementing a fully functional and fast autocomplete.