Five years ago two bags of grass, seventy-five pellets of mescaline, three C++ developers, a support engineer, a power user of Sphinx Search / backend team lead, an experienced manager, a mother of five helping us part-time, and a ton of bugs, crashes, and technical debts. So we got a shovel and other digging tools and started working to get it up to the search engine industry standards. Not that Sphinx was impossible to use, but many things were missing, and existing features weren’t quite stable or mature. And we had pushed it about as far as we could. So after 5 years and hundreds of new users, we’re ready to say that Manticore Search can be used as an alternative to Elasticsearch for both full-text search and (now) data analytics too.
In this article, I want to:
⭐⭐⭐ Your star on
GitHub supports the project and makes us think we are on the right path!⭐⭐⭐
The first Apple store opened, Windows XP, iTunes and Mac OS X were released.
The genius Andrey Aksyonoff started working on Sphinx Search, for which I want to thank him very much! There was no SOLR and Elasticsearch yet, but there was already Lucene, on which they were both subsequently built. Sphinx Search started slowly coming together, and in a few years became quite popular technology having an impact on thousands of websites using it.
Retina display, systemd, Ipad, and Elasticsearch appeared.
By this time Sphinx was already a popular full-text search engine, but the Sphinx’s concept of “source data has to be somewhere and we just make a full-text index that needs to be rebuilt regularly” was not as interesting as Elasticsearch’s “give me any JSON via HTTP in real-time, I will find a node to place it on”. SOLR wasn’t very good with data distribution, and JSON was gaining popularity, while XML was losing its attraction. Soon Elasticsearch started to rapidly gain popularity.
As a result, the frustrated users and some former Sphinx developers teamed up and built a fork - Manticore Search. Our primary goals were as follows:
“Okay, who wants to find out if this thing works?”
🙁 Sphinx 2: The main use case is indexing data from an external database: Sphinx returns id, then by id you have to go to the database and search there for the source document. The data schema can only be declared in the config.
✅ Manticore: The basic way to work with it is exactly the same as in MySQL / Postgres and Elasticsearch: a table can be created on the fly, data can be modified by a single/bulk INSERT/REPLACE/DELETE query, the data gets automatically compacted in the background. There is no need to look up the original document in an external source. Auto ID supported.
🙁 Sphinx 2: No replication.
✅ Manticore: Replication based on Galera, which is also used by Mariadb and Percona Server.
🙁 Sphinx 2: Queries can be done via SQL (MySQL wire protocol) or Sphinx binary protocol, there are clients for a few programming languages.
✅ Manticore: Added JSON interface very similar to Elasticsearch’s. Based on the new protocol, new clients for PHP, Python, Java, Javascript, and Elixir were built. The clients are generated automatically, making new functionality available in the client sooner after it appears in the engine.
🙁 Sphinx 2: Difficult to configure text tokenization for most languages
✅ Manticore: Simplified: made aliases cjk
and non_cjk
. Made tokenization of Chinese based on ICU. Added many new stemmers, including Ukrainian.
🙁 Sphinx 2: No official docker image and no support in the Kubernetes ecosystem
✅ Manticore: Made and support
🙁 Sphinx 2: No APT/YUM/Homebrew repositories
✅ Manticore: Added
🙁 Sphinx 2: Novice users had a hard time understanding what’s what.
✅ Manticore: Made platform with interactive courses —
🙁 Sphinx 2: Few examples in the documentation
✅ Manticore: rewrote documentation, made our own rendering engine for it -
🙁 Sphinx 2: Bugs, that often lead to crashes
✅ Manticore: Crashes are now rare. Hundreds of old bugs have been fixed.
🙁 Sphinx 2: Running search queries in parallel is limited
✅ Manticore: Migrated to
🙁 Sphinx 2: Cannot be used without full-text fields
✅ Manticore: Can be used without full-text, like any other database.
🙁 Sphinx 2: Non-full-text data is stored row-wise, it must be in memory to work efficiently.
✅ Manticore: Implemented and open-sourced
🙁 Sphinx 2: No secondary indexes
✅ Manticore: The second important functionality of Manticore Columnar LIbrary is support for secondary indexes based on the modern and innovative
🙁 Sphinx 2: No percolate indexes for reverse search (when there are queries in the index and documents are used as input to find out which queries would match them)
✅ Manticore: Added percolate type indexes.
This is approximately only a third of the changes - the ones you can easily see. On top of that, there have been many months of refactoring different parts of the system, resulting in a much simpler, more reliable, and more productive code. We hope this will attract new developers to the project.
Elasticsearch is fine: it’s not very hard to use up to a certain amount of data, there’s replication, fault tolerance, and rich functionality. But there are nuances.
Let’s take a look at those nuances and what Manticore is like compared to Elasticsearch now (July 2022). Future reader, we’ve already bolted something else on, check out our
Performance, namely low response time, is important in many cases, especially in log and data analytics, when there is a lot of data and not many search queries. You don’t want to wait 30 seconds instead of two for a response, do you? So here’s to the nuances: Elasticsearch is considered a standard for log management, but, for example, it can’t effectively parallelize a query to a single index shard. And Elasticsearch has only 1 shard by default, but there are much more CPU cores in modern servers. Making too many shards is also bad. All this doesn’t make life any easier for a devops who cares about the response time: you have to think about what hardware Elasticsearch will run on and make changes accordingly.
Manticore, on the contrary, is able to parallelize the search query to all CPU cores unconditionally and by default. It would be more correct to say that Manticore itself decides when to parallelize and when not, but in most cases it does, which allows you to efficiently load the CPU cores (which are often idle in cases of logging and data analytics) and significantly reduce response time.
But even if you make as many shards in Elasticsearch as there are CPU cores on the server, Manticore turns out to be significantly faster, specifically: here’s a test for 1.7 billion documents, from which you can see that overall Manticore is 4 times faster than Elasticsearch. If you are interested in the details or want to reproduce that on your own hardware, here is an article
Here is a different case: no big data, just 1.1 million comments from Hacker News. In this test, Manticore is 15x faster than Elasticsearch.
And another test indicative for Elasticsearch as a standard log analytics tool - 10 million Nginx logs and various quite realistic analytical queries - Manticore is 22 times faster than Elasticsearch here.
There are also nuances with Elasticsearch’s write speed. For example, the dataset for the 1.7 billion-document test discussed above was loaded:
This was on a 32-core server with SSD. The amounts of data after indexing are about the same. To learn more about how exactly the load was handled
In brief:
indexer
to put data to 32 shards in parallel.Here is the log of the data loading to Elasticsearch and Manticore: https://gist.github.com/sanikolaev/678dd862a7668921e3417321be0a2513
It turns out that in this test Manticore is 25 times faster in terms of data ingestion. Maybe I don’t know how to bake Logstash and Elasticsearch, but the import of the same dataset (but of a slightly smaller size) took
Maybe the problem is in Logstash, not Elasticsearch? Let’s go find out by writing directly to Elasticsearch. The index scheme is as follows:
"properties": {
"name": {"type": "text"},
"email": {"type": "keyword"},
"description": {"type": "text"},
"age": {"type": "integer"},
"active": {"type": "integer"}
}
Starting Manticore and Elasticsearch using their official docker images like this:
docker run --name manticore --rm -p 9308:9308 -v $(pwd)/manticore_idx:/var/lib/manticore manticoresearch/manticore:5.0.2
docker run --name elasticsearch --rm -p 9200:9200 -e discovery.type=single-node -e xpack.security.enabled=false -v $(pwd)/es_idx/:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:8.3.2
Let’s now put 50 million random docs like this to both:
{
1,
84,
"Aut corporis qui necessitatibus architecto est. Harum laboriosam temporibus praesentium quis et nulla. Consequuntur quia neque et repellat.",
"[email protected]",
"Keely Doyle Sr."
}
We’ll use
root@perf3 ~ # php load_elasticsearch.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 178.24096798897
280519 docs per sec
root@perf3 ~ # php load_manticore.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 215.7572619915
231742 docs per sec
OK, now Elastic is 21% faster, but again there is an interesting nuance: Elasticsearch by default /bulk?refresh=1
in Elasticsearch and see what it gives:
root@perf3 ~ # php load_elasticsearch.php 10000 32 1000000 50
preparing...
found in cache
querying...
finished inserting
Total time: 307.47588610649
162614 docs per sec
In this case Manticore is again faster by 43%.
If we want to test the maximum performance, we can:
Here’s what it gives:
Manticore:
// docker run -p9306:9306 --name manticore --rm -v $(pwd)/manticore_idx:/var/lib/manticore -e searchd_binlog_path= manticoresearch/manticore:5.0.2
root@perf3 ~ # php load_manticore_sharded.php 10000 32 1000000 32 50
preparing...
found in cache /tmp/bc9719fb0d26e18fc53d6d5aaaf847b4_10000_1000000
querying...
finished inserting
Total time: 55.874907970428
894856 docs per sec
Elasticsearch:
root@perf3 ~ # php load_elasticsearch_sharded.php 10000 32 1000000 32 50
preparing...
found in cache
querying...
finished inserting
Total time: 119.96515393257
416788 docs per sec
But, remember the nuance: you have to spend another 13 seconds to make the documents searchable:
root@perf3 ~ # curl -s -X POST "localhost:9200/_sql?format=json&pretty" -H 'Content-Type: application/json' -d'{"query": "select count(*) from user"}' {
"columns" : [
{
"name" : "count(*)",
"type" : "long"
}
],
"rows" : [
[
0
]
]
}
root@perf3 ~ # time curl -XPOST "localhost:9200/user/_refresh"
{"_shards":{"total":64,"successful":32,"failed":0}}
real 0m13.505s
user 0m0.003s
sys 0m0.000s
root@perf3 ~ # curl -s -X POST "localhost:9200/_sql?format=json&pretty" -H 'Content-Type: application/json' -d'{"query": "select count(*) from user"}'
{
"columns" : [
{
"name" : "count(*)",
"type" : "long"
}
],
"rows" : [
[
50000000
]
]
}
All in all, Manticore is 2x faster than Elasticsearch in terms of data ingestion performance. And the data is searchable immediately after the batch is loaded, not 2 minutes later. The scripts used for this test can be found
Both Elasticsearch and Manticore can do both SQL and JSON, but the difference is:
{
and }
brackets or … ?
SELECT id
INSERT/UPDATE/DELETE
In some cases, you need to be able to launch a service quickly. For example, in IoT (Internet of things) or some ETL scenarios.
As mentioned above, by defaultwhen you put data to Elasticsearch, it becomes searchable only after a second. This can be adjusted, but then the ingestion rate will become significantly slower, as you can see above.
Manticore always works in real-time mode.
Probably worth another article to explain it all. In short: both Manticore and Elasticsearch are good in terms of full-text search, have a lot in common, but there are a lot of differences, too. According to
Both Manticore and Elasticsearch provide rich aggregation functionality. You probably know what Elasticsearch can do, here’s what can be done in Manticore for you to compare:
Just grouping: SELECT release_year FROM films GROUP BY release_year LIMIT 5
Get aggregates: SELECT release_year, AVG(rental_rate) FROM films GROUP BY release_year LIMIT 5
Sort buckets: SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5
Group by multiple fields at the same time: SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC
Get N records from each bucket, not 1: SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6
Sort inside a bucket: SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5
Filter buckets: SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3
Use GROUPBY()
to access aggregation key: SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002)
Group by array value: SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc
Group by json node: SELECT groupby() color, count(*) from products GROUP BY meta.color
Get count of distinct values: SELECT major, count(*), count(distinct age) FROM students GROUP BY major
Use GROUP_CONCAT()
: SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major
Use FACET
after your main query and it will group the main query’s results: SELECT *, price AS aprice FROM facetdemo LIMIT 10 FACET price LIMIT 10 FACET brand_id LIMIT 5
Faceting by aggregation over another attribute: SELECT * FROM facetdemo FACET brand_name by brand_id
Faceting without duplicates: SELECT brand_name, property FROM facetdemo FACET brand_name distinct property
Facet over expressions: SELECT * FROM facetdemo FACET INTERVAL(price,200,400,600,800) AS price_range
Facet over multi-level grouping: SELECT *,INTERVAL(price,200,400,600,800) AS price_range FROM facetdemo FACET price_range AS price_range, brand_name ORDER BY brand_name asc;
Sorting of facet results:
SELECT * FROM facetdemo
FACET brand_name BY brand_id ORDER BY FACET() ASC
FACET brand_name BY brand_id ORDER BY brand_name ASC
FACET brand_name BY brand_id ORDER BY COUNT(*) DESC
Pagination in facet results:
SELECT * FROM facetdemo
FACET brand_name BY brand_id ORDER BY FACET() ASC LIMIT 0,1
FACET brand_name BY brand_id ORDER BY brand_name ASC LIMIT 2,4
FACET brand_name BY brand_id ORDER BY COUNT(*) DESC LIMIT 4;
Elasticsearch is famous for the fact that you can write anything into it. With Manticore Search, you have to create a scheme beforehand. Many Elasticsearch experts recommend using static mapping, for example,
One of the very first things you can do is to define your indice mapping statically.
But we find dynamic mapping important in the area of log management and analysis. Since we want Manticore to be easy to use for thatwe have plans to enable dynamic mapping in Manticore, too.
curl
, because the commands are much more compact and the session is supported.# download manticore beta version with support for Kibana, check https://repo.manticoresearch.com/repository/kibana_beta/ for different OS versions
wget https://repo.manticoresearch.com/repository/kibana_beta/ubuntu/jammy.zip
# unarchive it
unzip jammy.zip
# install the packages
dpkg -i build/*
# switch Manticore to the mode supporting Kibana
mysql -P9306 -h0 -e "set global log_management = 0; set global log_management = 1;"
# start Kibana pointing it to Manticore Search instance listening on port 9308
docker run -d --name kibana --rm -e ELASTICSEARCH_HOSTS=http://127.0.0.1:9308 -p 5601:5601 --network=host docker.elastic.co/kibana/kibana:7.4.2
# install php and composer, download loading script and put into Manticore 1 million docs of fake users
apt install php composer php8.1-mysql
wget https://gist.githubusercontent.com/sanikolaev/13bf61bbe6c39350bded7c577216435f/raw/8d8029c0d99998c901973fd9ac66a6fb920deda7/load_manticore_sharded.php
composer require fakerphp/faker
php load_manticore_sharded.php 10000 16 1000000 16 1
# don't forget to create an index patter in Kibana (user*)
# run `docker stop kibana` to stop the Kibana server
If all went well you should see:
Unlike Elasticsearch, Manticore does not yet have automatic sharding, but combining multiple indexes into one for manual sharding is easier than in Elasticsearch:
Adding an index located on a remote node is also supported, just specify the remote host, port, and index name.
Our thinking is that we don’t want our users, be it a developer or a devops to become experts in databases or search engines or have a PhD to be able to use Manticore products. We assume you have other things to do rather than spending hours trying to understand how this or that setting affects this or that functionality. Hence, Manticore Search should work fine in most cases even on defaults.
Our ultimate goal is to make Manticore Search as easy to use and learn as possible.
mnt.cr/<keyword>
takes you to the search results in the documentation in special mode - it immediately rewinds to the most relevant section. This is especially handy when you need to recall some details on some setting, e.g.
In Elasticsearch, most things are only done through the API. There is no way (
Manticore, like Kubernetes, supports two usage modes:
CREATE TABLE/DROP TABLE/ALTER TABLE, CREATE CLUSTER/JOIN CLUSTER/DELETE CLUSTER
etc.
Percolate or Persistent Query is when a table contains queries, not documents, and the search is performed on documents, not queries. The search results are queries that satisfy the documents. This type of search is useful for users’ subscriptions: if you subscribed, for example, to the query TV > 42 inches
, then as soon as it appears on the site, you will be notified about it. Manticore provides the functionality for that as well as Elasticsearch. According to the
We are now developing the project in the following directions:
So, at the end of it all, what do we have? Manticore may now be of interest to those:
We are continuing!
⭐⭐⭐ Your star on
GitHub supports the project and makes us think we are on the right path!⭐⭐⭐
Also Published Here