Spell correction also known as:
and so on is a software functionality that suggests you alternatives to or makes automatic corrections of the text you have typed in.
The concept of correcting typed text dates back to the 1960s, says Brad Myers, a professor of interface design at Carnegie Mellon University. This’s when a computer scientist named Warren Teitelman who also invented the “undo” command came up with a philosophy of computing called D.W.I.M., or “Do What I Mean.” Rather than programming computers to accept only perfectly formatted instructions, Teitelman said we should program them to recognize obvious mistakes.
The first well known product which provided spell correction functionality was Microsoft Word 6.0 released in 1993.
Nowadays people can’t imagine their real life without messaging, searching, browsing and using different business applications: text editors, content management systems (CMS), customer relationship management systems (CRM) and others. And in all of them people do typing mistakes, forget proper spelling of words, sometimes are wrong with grammar and so on.
This is absolutely fine, everybody makes mistakes. What a good software does is helping you fix those mistakes the easiest and most efficient way. There are many approaches how it can do that: both from how it looks for a user and what value it brings and how it’s made under the hood.
In this article we will focus specifically on search systems as for them it’s especially important to correct your search query before (or while) it’s processed since otherwise your search results may be absolutely wrong.
In terms of user experience we can split the approaches into 2 groups:
In most cases these approaches are combined so a system tries to correct you as you type and if it thinks there’s still some issue — suggests you a correction after you get the search results.
Let me demonstrate you with pictures how different popular sites use the aforementioned techniques.
Google as the site #1 in the Internet does everything: autocorrect, autocomplete, suggestions, mistake highlighting and “did you mean”. Google uses a prediction service to help complete searches and even URLs typed in the address bar: These suggestions or predictions (as Google names that) are based on related web searches, your browsing history, and Google’s websites ranking.
Amazon uses e-Commerce oriented search algorithm called A9. What differs it from Google’s algorithm is that it puts a strong emphasis on sales conversions.
In terms of Amazon’s autocorrect implementation it consists of all the previously mentioned components, but since it’s a marketplace its suggestions ranking formula seems to include the factor of how attractive a suggestion can be to you in terms of sales.
Pinterest also has its own algorithm based on internal domain/pins/pinners relevance.
Let’s now think about how spell correction can be done. There are few ways, but the important thing is that there is no purely programmatic way which will convert your mistyped “ipone” into “iphone” (at least with decent quality). Mostly there has to be a data set the system is based on. The data set can be:
But it might still make sense for smaller data collections as then there’s a higher chance the search query will be corrected at all
But if I change it to “dark ber” it understands I mean beer:
The context may be not just a neighbor word in your query, but your location, date of time, current sentence’s grammar (to let’s say change “there” to “their” or not), your search history and virtually any other things that can affect your intent.
For all the approaches can be used different algorithms of searching and evaluation of candidate spell corrections:
Looking at how fast AI, machine learning and semantic search have been evolving lately I believe in this decade the landscape of spell correction techniques may change dramatically, but for now the above approaches seem to be most used. And most solutions are based on your existing data collection and the dictionary made up from it.
Here I will describe a part of my own experience with making a website search with help of Manticore Search. I discovered this product about a year ago while working on some clients projects and I’m still using it in few of my projects due to its lightweight and fast search with a lot of out of the box functions.
Let’s now speak about what tools you can use to add the autocorrect functionality to your system and at the same time go a little bit deeper into understanding how at least simpler techniques work.
Todo that we will take one of popular custom search engines as they often provide the functionality we’re looking for. I like Manticore Search as in contrast to more known Elasticsearch, it is much more lightweight, easier to deal with as it’s SQL-native and works faster in most cases. It’s so SQL native that you can even connect to it with a standard MySQL client. They provide an official docker image, so I’ll use that to demonstrate how autocorrect works in Manticore Search:
➜ ~ docker run --name manticore -d manticoresearch/manticore
1ac28e94b3ca728fb431ba79255f81eb31ec264b5d9d49f1d2db973e08a32b9f
And inside the container I’ll run mysql to connect to the Manticore Search server running there:
➜ ~ docker exec -it manticore mysql
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.5.21
Copyright © 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
It ships with 2 test tables that we can use for a quick play with spell correction:
MySQL [(none)]> show tables;
+--------+-----------+
| Index | Type |
+--------+-----------+
| pq | percolate |
| testrt | rt |
+--------+-----------+
2 rows in set (0.00 sec)
Let’s check the schema of testrt:
MySQL [(none)]> describe testrt;
+---------+--------+------------+
| Field | Type | Properties |
+---------+--------+------------+
| id | bigint | |
| title | field | stored |
| content | field | stored |
| gid | uint | |
+---------+--------+------------+
4 rows in set (0.00 sec)
and insert few documents into it. Let it be movie titles:
MySQL [(none)]> insert into testrt(title) values('The Godfather'),('The Wizard of Oz'),('Combat!'),('The Shawshank Redemption'),('Pulp Fiction'),('Casablanca'),('The Godfather: Part II'),('Mortal Kombat'),('2001: A Space Odyssey'),
('Schindler\'s List');Query OK, 10 rows affected (0.00 sec)
We can now see all our documents in the index:
MySQL [(none)]> select * from testrt;
+---------------------+------+--------------------------+---------+
| id | gid | title | content |
+---------------------+------+--------------------------+---------+
| 1657719407478046741 | 0 | The Godfather | |
| 1657719407478046742 | 0 | The Wizard of Oz | |
| 1657719407478046743 | 0 | Combat! | |
| 1657719407478046744 | 0 | The Shawshank Redemption | |
| 1657719407478046745 | 0 | Pulp Fiction | |
| 1657719407478046746 | 0 | Casablanca | |
| 1657719407478046747 | 0 | The Godfather: Part II | |
| 1657719407478046748 | 0 | Mortal Kombat | |
| 1657719407478046749 | 0 | 2001: A Space Odyssey | |
| 1657719407478046750 | 0 | Schindler's List | |
+---------------------+------+--------------------------+---------+
10 rows in set (0.00 sec)
What has just happened: when we inserted the titles not just Manticore Search saved them as is, but it made a lot of things essential for natural language processing (NLP): tokenization (to split the titles into words), normalization (to lowercase everything) and many other things, but there was one important for us which is that it has split the words into trigrams they consist from. For example “Combat!” (or actually “combat” after normalization) was split into “com”, “omb”, “mba”, “bat”. So why is that important?
To answer that let’s assume we make a mistype in “combat” and type “comabt” instead. If we now want “combat” to be found how would we do that? The options are:
So in this case what’s done internally is that Manticore Search first splits “comabt” into infixes: “com”, “oma”, “mab” and “abt” and then finds the corresponding words in the dictionary as when the document was inserted it also did the same. Since the complexity of this algorithm is not very high (much lower than comparing each word with each other) Manticore quickly finds that there’s a word “combat” which also has “com” and it’s the only candidate word. So all it has to do after that is to make sure the candidate word doesn’t differ very much from the given word. It calculates the Levensthein distance, figures it’s not greater than the limit (4 by default) and returns the candidate:
MySQL [(none)]> call qsuggest('comabt', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| combat | 2 | 1 |
+---------+----------+------+
1 row in set (0.00 sec)
In other cases there may be more candidates. For example if you search for “combat” there is one word in the dictionary which exactly matches your query (the Levenshtein distance is 0), but there’s another which you might mean and CALL SUGGEST returns it too:
MySQL [(none)]> call qsuggest('combat', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| combat | 0 | 1 |
| kombat | 1 | 1 |
+---------+----------+------+
2 rows in set (0.00 sec)
If you enter 2 words CALL QSUGGEST only process your last one:
MySQL [(none)]> call qsuggest('mortal combat', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| combat | 0 | 1 |
| kombat | 1 | 1 |
+---------+----------+------+
2 rows in set (0.00 sec)
So how do I use this functionality in my application?” you may ask. It’s quite simple. Just:
Here’s the log of what it would give in the case of our dataset, note the input (after “call qsuggest”) and the output (in column “suggest”):
MySQL [(none)]> call qsuggest('m', 'testrt');
Empty set (0.00 sec)
MySQL [(none)]> call qsuggest('mo', 'testrt');
Empty set (0.00 sec)
MySQL [(none)]> call qsuggest('mor', 'testrt');
Empty set (0.00 sec)
MySQL [(none)]> call qsuggest('mort', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| mortal | 2 | 1 |
+---------+----------+------+
1 row in set (0.01 sec)
MySQL [(none)]> call qsuggest('morta', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| mortal | 1 | 1 |
+---------+----------+------+
1 row in set (0.00 sec)
MySQL [(none)]> call qsuggest('mortal', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| mortal | 0 | 1 |
+---------+----------+------+
1 row in set (0.00 sec)
MySQL [(none)]> call qsuggest('mortal c', 'testrt');
Empty set (0.01 sec)
MySQL [(none)]> call qsuggest('mortal co', 'testrt');
Empty set (0.00 sec)
MySQL [(none)]> call qsuggest('mortal com', 'testrt');
Empty set (0.00 sec)
MySQL [(none)]> call qsuggest('mortal comb', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| combat | 2 | 1 |
| kombat | 3 | 1 |
+---------+----------+------+
2 rows in set (0.00 sec)
MySQL [(none)]> call qsuggest('mortal comba', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| combat | 1 | 1 |
| kombat | 2 | 1 |
+---------+----------+------+
2 rows in set (0.00 sec)
MySQL [(none)]> call qsuggest('mortal combat', 'testrt');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| combat | 0 | 1 |
| kombat | 1 | 1 |
+---------+----------+------+
2 rows in set (0.00 sec)