Manticore Search: Wordforms vs Exceptions

Exceptions and wordforms are two useful tools built into Manticore Search, which you can use to improve search recall and precision. They have a lot in common, but also have important differences that I’d like to cover in this article. About tokenization What’s the difference between full-text search (also called free-text search) and wildcard kinds of search such as: commonly known operator in this or that form LIKE or more complex regular expressions Of course, there are tons of differences, but it all starts with what we do with the initial input text in each of the approaches: with the wildcard search approach we normally consider the text as a whole while in the area of full-text search it’s essential to first the text and then consider each token as a separate entity tokenize When you want to tokenize text you need to decide how to do it, in particular: What should be separators and word characters. Normally a separator is a character that doesn’t occur inside a word, for example punctuation marks: , , , , etc. . , ? ! - Whether the tokens’ letter case should be retained or not. Normally it’s not since it’s bad for search that you don’t find by the keyword . Orange orange Manticore does it all automatically. For example, text “ ” gets tokenized into: What do I have? The list is: a cat, a dog and a parrot. mysql> drop table if exists t; mysql> create table t(f text); mysql> call keywords('What do I have? The list is: a cat, a dog and a parrot.', 't'); +------+-----------+------------+ | qpos | tokenized | normalized | +------+-----------+------------+ | 1 | what | what | | 2 | do | do | | 3 | i | i | | 4 | have | have | | 5 | the | the | | 6 | list | list | | 7 | is | is | | 8 | a | a | | 9 | cat | cat | | 10 | a | a | | 11 | dog | dog | | 12 | and | and | | 13 | a | a | | 14 | parrot | parrot | +------+-----------+------------+ 14 rows in set (0.00 sec) As you can see: the punctuation marks were removed and all the words were lowercased Problem Here comes the first problem: in some cases separators are considered regular word characters, for example in “ ” it’s obvious that is a separate word. It’s easy to understand for people, but not for full-text search algorithms, since it sees the plus sign, doesn’t find it in it’s list of word characters and removes it from the token, so you end up with: Is c++ the most powerful language? c++ mysql> drop table if exists t; mysql> create table t(f text); mysql> call keywords('Is c++ the most powerful language?', 't'); +------+-----------+------------+ | qpos | tokenized | normalized | +------+-----------+------------+ | 1 | is | is | | 2 | c | c | | 3 | the | the | | 4 | most | most | | 5 | powerful | powerful | | 6 | language | language | +------+-----------+------------+ 6 rows in set (0.00 sec) OK, but what’s the problem? The problem is that after this tokenization if you search for , for example, you will find the above sentence: c# mysql> drop table if exists t; mysql> create table t(f text); mysql> insert into t values(0,'Is c++ the most powerful language?'); mysql> select highlight() from t where match('c#'); +-------------------------------------------+ | highlight() | +-------------------------------------------+ | Is c ++ the most powerful language? | +-------------------------------------------+ 1 row in set (0.01 sec) It happens because is also tokenized to just and then the from the search query matches the from the document and you get it. c# c c c What’s the solution? There are a few options. The first one which probably comes to mind is: OK, why don’t I put + and # to the word characters list? It’s a good and fair question. Let’s try. mysql> drop table if exists t; mysql> create table t(f text) charset_table='non_cjk,+,#'; mysql> call keywords('Is c++ the most powerful language?', 't'); +------+-----------+------------+ | qpos | tokenized | normalized | +------+-----------+------------+ | 1 | is | is | | 2 | c++ | c++ | | 3 | the | the | | 4 | most | most | | 5 | powerful | powerful | | 6 | language | language | +------+-----------+------------+ 6 rows in set (0.00 sec) It works, but in the list immediately starts affecting other words and searches, for example: + mysql> drop table if exists t; mysql> create table t(f text) charset_table='non_cjk,+,#'; mysql> call keywords('What is 2+2?', 't'); +------+-----------+------------+ | qpos | tokenized | normalized | +------+-----------+------------+ | 1 | what | what | | 2 | is | is | | 3 | 2+2 | 2+2 | +------+-----------+------------+ 3 rows in set (0.00 sec) You wanted to be a separate word, but not , didn’t you? c++ 2+2 Right, so what can we do? To treat special way you can make it an exception. c++ Exceptions So, (also known as synonyms) allow to map one or more terms (including terms with characters that would normally be excluded) to a single keyword. exceptions Let’s make an exception by putting it into an exceptions file: c++ ➜ ~ cat /tmp/exceptions c++ => c++ and using the file when we create the table: mysql> drop table if exists t; mysql> create table t(f text) exceptions='/tmp/exceptions'; mysql> call keywords('Is c++ the most powerful language? What is 2+2?', 't'); +------+-----------+------------+ | qpos | tokenized | normalized | +------+-----------+------------+ | 1 | is | is | | 2 | c++ | c++ | | 3 | the | the | | 4 | most | most | | 5 | powerful | powerful | | 6 | language | language | | 7 | what | what | | 8 | is | is | | 9 | 2 | 2 | | 10 | 2 | 2 | +------+-----------+------------+ 10 rows in set (0.01 sec) Hooray, is now a separate word and the plus signs are not lost, and all is ok with too. c++ 2+2 What you need to remember about the exceptions is that exceptions are very dumb, not smart at all, they do exactly what you ask them to do and nothing more. In particular: they don’t change the case if you make a mistake and put double space they don’t convert it into a single space and so on. They literally consider your input as an array of bytes. For example, people write both in lower and upper case. Let’s try the above exception with the upper case? c++ mysql> drop table if exists t; mysql> create table t(f text) exceptions='/tmp/exceptions'; mysql> call keywords('Is C++ the most powerful language? How about c++?', 't'); +------+-----------+------------+ | qpos | tokenized | normalized | +------+-----------+------------+ | 1 | is | is | | 2 | c | c | | 3 | the | the | | 4 | most | most | | 5 | powerful | powerful | | 6 | language | language | | 7 | what | what | | 8 | is | is | | 9 | 2 | 2 | | 10 | 2 | 2 | +------+-----------+------------+ 10 rows in set (0.00 sec) Oops, was tokenized as just , because the exception is (lower case), not (upper case). C++ c c++ C++ But did you notice the exception constitutes a pair of items, not a single one: . The left part is what triggers the exceptions algorithm in the text, the right part is a resulting token. Let’s try to add mapping of to ? c++ => c++ C++ c++ ➜ ~ cat /tmp/exceptions c++ => c++ C++ => c++ mysql> drop table if exists t; mysql> create table t(f text) exceptions='/tmp/exceptions'; mysql> call keywords('Is C++ the most powerful language? How about c++?', 't'); +------+-----------+------------+ | qpos | tokenized | normalized | +------+-----------+------------+ | 1 | is | is | | 2 | c++ | c++ | | 3 | the | the | | 4 | most | most | | 5 | powerful | powerful | | 6 | language | language | | 7 | how | how | | 8 | about | about | | 9 | c++ | c++ | +------+-----------+------------+ 9 rows in set (0.00 sec) Alright, now it’s fine again, since both and are tokenized into token . So satisfying. C++ c++ c++ What are the other good examples of the exceptions: and . AT&T => AT&T at&t => AT&T and and M&M's => M&M's m&m's => M&M's M&m's => M&M's and U.S.A. => USA US => USA What are the bad examples? , because we don’t want each become . us => USA us USA So the rule of thumb with the exceptions is: If a term includes special characters and that’s how it’s normally written in text and in a search query - make it an exception. Synonyms Manticore Search users also often call synonyms, because another use case for them is not to just retain the special character and letter case, but to map terms written absolutely differently to the same token, for example: exceptions MS Windows => ms windows Microsoft Windows => ms windows Why is this important? Because it enables to easily find documents with by and vice versa. Microsoft Windows MS Windows Example: mysql> drop table if exists t; mysql> create table t(f text) exceptions='/tmp/exceptions'; mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems'); mysql> select * from t where match('MS Windows'); +---------------------+---------------------------------------------------------+ | id | f | +---------------------+---------------------------------------------------------+ | 1514841286668976139 | Microsoft Windows is one of the first operating systems | +---------------------+---------------------------------------------------------+ 1 row in set (0.00 sec) So at first glance, it works fine, but thinking further about it and recalling the exceptions are case and byte sensitive you can ask yourself: “Can’t people write , , and so on?” MicroSoft windows MS WINDOWS microsoft Windows Yes, they can. So if you want to use the exceptions for that be ready for what’s called in mathematics a combinatorial explosion. It looks no good at all. What can we do about it? Wordforms Another tool which is similar to the exceptions is . Unlike the exceptions, the word forms are applied after tokenizing incoming text. So they are: wordforms case insenstitive (unless your enables case sensitivity) charset_table don’t care about special characters They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form. For example, to normalize all the variants such as “walks”, “walked”, “walking” to the normal form “walk”: ➜ ~ cat /tmp/wordforms walks => walk walked => walk walking => walk mysql> drop table if exists t; mysql> create table t(f text) wordforms='/tmp/wordforms'; mysql> call keywords('walks _WaLkeD! walking', 't'); +------+-----------+------------+ | qpos | tokenized | normalized | +------+-----------+------------+ | 1 | walks | walk | | 2 | walked | walk | | 3 | walking | walk | +------+-----------+------------+ 3 rows in set (0.00 sec) As you can see all the 3 words were converted to just and, note, the 2nd word even being very deformed was also normalized fine. Do you see where I’m going with this? Yes, the example. Let’s test if the wordforms can be useful to solve that issue. walk _WaLkeD! MS Windows Let’s put just 2 lines to the wordforms file: ➜ ~ cat /tmp/wordforms ms windows => ms windows microsoft windows => ms windows and populate the table with a few documents: mysql> drop table if exists t; mysql> create table t(f text) wordforms='/tmp/wordforms'; mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems'), (0, 'porch windows'),(0, 'Windows are rolled down'); mysql> select * from t; +---------------------+---------------------------------------------------------+ | id | f | +---------------------+---------------------------------------------------------+ | 1514841286668976166 | Microsoft Windows is one of the first operating systems | | 1514841286668976167 | porch windows | | 1514841286668976168 | Windows are rolled down | +---------------------+---------------------------------------------------------+ 3 rows in set (0.00 sec) Let’s now try different queries: mysql> select * from t where match('MS Windows'); +---------------------+---------------------------------------------------------+ | id | f | +---------------------+---------------------------------------------------------+ | 1514841286668976166 | Microsoft Windows is one of the first operating systems | +---------------------+---------------------------------------------------------+ 1 row in set (0.00 sec) ✅ finds fine. MS Windows Microsoft Windows mysql> select * from t where match('ms WINDOWS'); +---------------------+---------------------------------------------------------+ | id | f | +---------------------+---------------------------------------------------------+ | 1514841286668976166 | Microsoft Windows is one of the first operating systems | +---------------------+---------------------------------------------------------+ 1 row in set (0.01 sec) ✅ works fine too. ms WINDOWS mysql> select * from t where match('mIcRoSoFt WiNdOwS'); +---------------------+---------------------------------------------------------+ | id | f | +---------------------+---------------------------------------------------------+ | 1514841286668976166 | Microsoft Windows is one of the first operating systems | +---------------------+---------------------------------------------------------+ 1 row in set (0.00 sec) ✅ And even finds the same document. mIcRoSoFt WiNdOwS mysql> select * from t where match('windows'); +---------------------+---------------------------------------------------------+ | id | f | +---------------------+---------------------------------------------------------+ | 1514841286668976166 | Microsoft Windows is one of the first operating systems | | 1514841286668976167 | porch windows | | 1514841286668976168 | Windows are rolled down | +---------------------+---------------------------------------------------------+ 3 rows in set (0.00 sec) ✅ Just basic finds all the documents. windows So indeed, helps to solve the issue. wordforms The rule of thumb with the wordforms is: Use wordforms for words and phrases that can be written in different forms and don’t contain special characters. Floor & Decor Let’s take a look at another example: we want to improve search for the brand name . We can assume people can write this name in the following forms: Floor & Decor Floor & Decor Floor & decor floor & decor Floor and Decor floor and decor and other letter capitalization combinations. Also: Floor & Decor Holdings Floor & Decor Holdings, inc. and, again, various combinations with different letter capitalized. Now that we know how and work what do we do to cover this brand name? exceptions wordforms First of all we can easily notice that the canonical brand name is , i.e. it includes a special character which is normally considered a word separator, so should we use ? But the name is long and can be written in many ways. If we use we can end up with a huge list of all the combinations. Moreover there are extended forms and which can make the list even longer. Floor & Decor exceptions exceptions Floor & Decor Holdings Floor & Decor Holdings, inc. The most solution in this case is to just use like this: optimal wordforms ➜ ~ cat /tmp/wordforms floor & decor => fnd floor and decor => fnd floor & decor holdings => fnd floor and decor holdings => fnd floor & decor holdings inc => fnd floor and decor holdings inc => fnd Why does it include ? Actually you can skip it: & floor decor => fnd floor and decor => fnd floor decor holdings => fnd floor and decor holdings => fnd floor decor holdings inc => fnd floor and decor holdings inc => fnd because anyway ignores non-word characters, but just for the sake of ease of reading it was left. wordforms As a result you’ll get each combination tokenized as which will be our shortkey for this brand name. fnd mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms'; mysql> call keywords('Floor & Decor', 't') +------+-------------+------------+ | qpos | tokenized | normalized | +------+-------------+------------+ | 1 | floor decor | fnd | +------+-------------+------------+ 1 row in set (0.00 sec) mysql> call keywords('floor and Decor', 't') +------+-----------------+------------+ | qpos | tokenized | normalized | +------+-----------------+------------+ | 1 | floor and decor | fnd | +------+-----------------+------------+ 1 row in set (0.00 sec) mysql> call keywords('Floor & Decor holdings', 't') +------+----------------------+------------+ | qpos | tokenized | normalized | +------+----------------------+------------+ | 1 | floor decor holdings | fnd | +------+----------------------+------------+ 1 row in set (0.00 sec) mysql> call keywords('Floor & Decor HOLDINGS INC.', 't') +------+--------------------------+------------+ | qpos | tokenized | normalized | +------+--------------------------+------------+ | 1 | floor decor holdings inc | fnd | +------+--------------------------+------------+ 1 row in set (0.00 sec) Is this the perfect ultimate solution? Unfortunately not as many other things in the area of full-text search. There are always rare cases and in this case too. For example: mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms'; mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice'); mysql> select * from t where match('Floor & Decor Holdings'); +---------------------+---------------------------------------------------+ | id | f | +---------------------+---------------------------------------------------+ | 1514841286668976231 | It's located on the 2nd floor. Decor is also nice | +---------------------+---------------------------------------------------+ 1 row in set (0.00 sec) We can see here that finds the document which has in the end of the first sentence and the following one starts with . This happens because also gets tokenized to since we use just that are insensitive to letter case and special characters: Floor & Decor Holdings floor Decor floor. Decor fnd wordforms mysql> call keywords('floor. Decor', 't'); +------+-------------+------------+ | qpos | tokenized | normalized | +------+-------------+------------+ | 1 | floor decor | fnd | +------+-------------+------------+ 1 row in set (0.00 sec) The false match is not good. To solve this particular problem we can use Manticore’s functionality to . detect sentences and paragraphs Now if we enable it we can see that the document is not a match for the keyword any more: mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms' index_sp='1'; mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice'); mysql> select * from t where match('Floor & Decor Holdings'); Empty set (0.00 sec) because: , as we remember is converted into by Floor & Decor fnd wordforms splits text into sentences index_sp='1' after splitting and end up in different sentences floor. Decor and do not match and therefore all the original forms of it anymore fnd Conclusion Manticore’s exceptions and wordforms are powerful tools that can help you fine-tune your search, in particular improve recall and precision when it comes to short terms with special characters that should be retained and longer terms that should be aliased one to another. But you need to help Manticore do it since it can’t decide what the names should be for you. Thank you for reading this article! Also Published Here References: Documentation about - exceptions https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Exceptions Documentation about - wordforms https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Wordforms