Manticore is a Faster Alternative to Elasticsearch in C++

Five years ago began as a fork of an open source version of the once popular search engine . We had , three C++ developers, a support engineer, a power user of Sphinx Search / backend team lead, an experienced manager, a mother of five helping us part-time, and a ton of bugs, crashes, and technical debts. So we got a shovel and other digging tools and started working to get it up to the search engine industry standards. Not that Sphinx was impossible to use, but many things were missing, and existing features weren’t quite stable or mature. And we had pushed it about as far as we could. So after 5 years and hundreds of new users, we’re ready to say that . Manticore Sphinx Search two bags of grass, seventy-five pellets of mescaline Manticore Search can be used as an alternative to Elasticsearch for both full-text search and (now) data analytics too In this article, I want to: Talk about the search engines era before SOLR and Elasticsearch Clearly depict the current situation with Manticore Search vs Elasticsearch Try to understand where we should head next ⭐⭐⭐ Your star on supports the project and makes us think we are on the right path!⭐⭐⭐ GitHub A little of the history 2001 - just Lucene and Sphinx The first Apple store opened, Windows XP, iTunes and Mac OS X were released. The genius Andrey Aksyonoff started working on Sphinx Search, for which I want to thank him very much! There was no SOLR and Elasticsearch yet, but there was already Lucene, on which they were both subsequently built. Sphinx Search started slowly coming together, and in a few years became quite popular technology having an impact on thousands of websites using it. 2010 - Elasticsearch appeared Retina display, systemd, Ipad, and Elasticsearch appeared. By this time Sphinx was already a popular full-text search engine, but the Sphinx’s concept of “source data has to be somewhere and we just make a full-text index that needs to be rebuilt regularly” was not as interesting as Elasticsearch’s “give me any JSON via HTTP in real-time, I will find a node to place it on”. SOLR wasn’t very good with data distribution, and JSON was gaining popularity, while XML was losing its attraction. Soon Elasticsearch started to rapidly gain popularity. 2017 - Manticore appeared Elastic had firmly established itself as a standard tool for full-text search and log and data analytics. Sphinx ceased its development as an open source project. Development, in general, slowed down significantly, and for some time was completely suspended. Many Sphinx users who loved it and knew how to deal with it were not pleased about this and it was painful for them to migrate to Elasticsearch. In addition, by then, Elasticsearch’s conceptual flaws had surfaced: excessive memory consumption, difficulty in maintaining large clusters, and some performance issues. As a result, the frustrated users and some former Sphinx developers teamed up and built a fork - Manticore Search. Our primary goals were as follows: Continue developing the project as an open source Look at everything from just a regular everyday normal user’s point of view and add the functionality they need in today’s environment Strengthen Sphinx’s strong sides and eliminate obvious weaknesses 2022: Five more years later “Okay, who wants to find out if this thing works?” 🙁 Sphinx 2: The main use case is indexing data from an external database: Sphinx returns id, then by id you have to go to the database and search there for the source document. The data schema can only be declared in the config. ✅ Manticore: The basic way to work with it is exactly the same as in MySQL / Postgres and Elasticsearch: a , the data gets automatically compacted in the background. There is no need to look up the original document in an external source. Auto ID supported. table can be created on the fly, data can be modified by a single/bulk INSERT/REPLACE/DELETE query 🙁 Sphinx 2: No replication. ✅ Manticore: , which is also used by Mariadb and Percona Server. Replication based on Galera 🙁 Sphinx 2: Queries can be done via SQL (MySQL wire protocol) or Sphinx binary protocol, there are clients for a few programming languages. ✅ Manticore: Added . Based on the new protocol, were built. The clients are generated automatically, making new functionality available in the client sooner after it appears in the engine. JSON interface very similar to Elasticsearch’s new clients for PHP, Python, Java, Javascript, and Elixir 🙁 Sphinx 2: Difficult to configure text tokenization for most languages ✅ Manticore: Simplified: made and . Made tokenization of Chinese based on ICU. Added many new stemmers, including Ukrainian. aliases cjk non_cjk 🙁 Sphinx 2: No official docker image and no support in the Kubernetes ecosystem ✅ Manticore: Made and support for Kubernetes and official docker Helm chart 🙁 Sphinx 2: No APT/YUM/Homebrew repositories ✅ Manticore: Added . in the . Each new commit becomes available as a package. repositories APT/YUM/Homebrew Nightly builds are also available development repository 🙁 Sphinx 2: Novice users had a hard time understanding what’s what. ✅ Manticore: Made — platform with interactive courses https://play.manticoresearch.com/ 🙁 Sphinx 2: Few examples in the documentation ✅ Manticore: , made our own rendering engine for it - . It’s also available in a simple for contributions and easy editing. rewrote documentation https://manual.manticoresearch.com/ markdown format 🙁 Sphinx 2: Bugs, that often lead to crashes ✅ Manticore: . Hundreds of old bugs have been fixed. Crashes are now rare 🙁 Sphinx 2: Running search queries in parallel is limited ✅ Manticore: Migrated to . Made it , so as to fully load the CPU and reduce the response time to a minimum coroutines possible to parallelize any search query 🙁 Sphinx 2: Cannot be used without full-text fields ✅ Manticore: Can be used . without full-text, like any other database 🙁 Sphinx 2: Non-full-text data is stored row-wise, it must be in memory to work efficiently. ✅ Manticore: Implemented and open-sourced , an external fully independent library that allows storing data column-oriented in blocks with support for different codecs for compressing different types of data efficiently. Requires almost no memory. You can now handle much larger amounts of data on the same server. Manticore Columnar Library 🙁 Sphinx 2: No secondary indexes ✅ Manticore: The second important functionality of Manticore Columnar LIbrary is based on the modern and innovative . support for secondary indexes PGM algorithm 🙁 Sphinx 2: No percolate indexes for reverse search (when there are queries in the index and documents are used as input to find out which queries would match them) ✅ Manticore: Added . percolate type indexes This is approximately only a third of the changes - the ones you can easily see. On top of that, there have been many months of refactoring different parts of the system, resulting in a much simpler, more reliable, and more productive code. We hope this will attract new developers to the project. What about Elasticsearch? Elasticsearch is fine: it’s not very hard to use up to a certain amount of data, there’s replication, fault tolerance, and rich functionality. But there are nuances. Let’s take a look at those nuances and what Manticore is like compared to Elasticsearch now (July 2022). Future reader, we’ve already bolted something else on, check out our . Changelog Search Speed Performance, namely low response time, is important in many cases, especially in log and data analytics, when there is a lot of data and not many search queries. You don’t want to wait 30 seconds instead of two for a response, do you? So here’s to the nuances: Elasticsearch is considered a standard for log management, but, for example, it can’t effectively parallelize a query to a single index shard. And Elasticsearch has only 1 shard by default, but there are much more CPU cores in modern servers. Making too many shards is also bad. All this doesn’t make life any easier for a devops who cares about the response time: you have to think about what hardware Elasticsearch will run on and make changes accordingly. unconditionally and by default. It would be more correct to say that Manticore itself decides when to parallelize and when not, but in most cases it does, which allows you to efficiently load the CPU cores (which are often idle in cases of logging and data analytics) and significantly reduce response time. Manticore, on the contrary, is able to parallelize the search query to all CPU cores But even if you make as many shards in Elasticsearch as there are CPU cores on the server, Manticore turns out to be significantly faster, specifically: here’s a test for 1.7 billion documents, from which you can see that overall . If you are interested in the details or want to reproduce that on your own hardware, here is an article (all examples below are also supported by scripts and links, etc., you won’t find any idle talking in this blog) Manticore is 4 times faster than Elasticsearch https://db-benchmarks.com/test-taxi/ Here is a different case: no big data, just 1.1 million comments from Hacker News. In this test, . . Manticore is 15x faster than Elasticsearch All the details here And another test indicative for Elasticsearch as a standard log analytics tool - 10 million Nginx logs and various quite realistic analytical queries - here. Manticore is 22 times faster than Elasticsearch All the details here Data ingestion performance There are also nuances with Elasticsearch’s write speed. For example, the dataset for the 1.7 billion-document test discussed above was loaded: to Elasticsearch - in 28 hours and 33 minutes to Manticore Search - 1 hour and 8 minutes. This was on a 32-core server with SSD. The amounts of data after indexing are about the same. To learn more about how exactly the load was handled . read here In brief: Source - csv Logstash was used to put data to Elasticsearch with PIPELINE_BATCH_SIZE=10000 and PIPELINE_WORKERS=32 in 32 shards. Manticore Search used a built-in tool to put data to 32 shards in parallel. indexer Here is the log of the data loading to Elasticsearch and Manticore: https://gist.github.com/sanikolaev/678dd862a7668921e3417321be0a2513 It turns out that in this test . Maybe I don’t know how to bake Logstash and Elasticsearch, but the import of the same dataset (but of a slightly smaller size) took even longer - 4 days and 16 hours. Manticore is 25 times faster in terms of data ingestion Mark Litwintschik Maybe the problem is in Logstash, not Elasticsearch? Let’s go find out by writing directly to Elasticsearch. The index scheme is as follows: "properties": { "name": {"type": "text"}, "email": {"type": "keyword"}, "description": {"type": "text"}, "age": {"type": "integer"}, "active": {"type": "integer"} } Starting Manticore and Elasticsearch using their official docker images like this: docker run --name manticore --rm -p 9308:9308 -v $(pwd)/manticore_idx:/var/lib/manticore manticoresearch/manticore:5.0.2 docker run --name elasticsearch --rm -p 9200:9200 -e discovery.type=single-node -e xpack.security.enabled=false -v $(pwd)/es_idx/:/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:8.3.2 Let’s now put 50 million random docs like this to both: { 1, 84, "Aut corporis qui necessitatibus architecto est. Harum laboriosam temporibus praesentium quis et nulla. Consequuntur quia neque et repellat.", "terrill52@herzog.com", "Keely Doyle Sr." } We’ll use with a batch size 10,000 and concurrency 32 (there are 16 physical CPU cores on the server and hyper-threading). simple php scripts root@perf3 ~ # php load_elasticsearch.php 10000 32 1000000 50 preparing... found in cache querying... finished inserting Total time: 178.24096798897 280519 docs per sec root@perf3 ~ # php load_manticore.php 10000 32 1000000 50 preparing... found in cache querying... finished inserting Total time: 215.7572619915 231742 docs per sec OK, now : Elasticsearch by default new documents for one second, which means . This is ok in many cases, but to make things fair let’s do in Elasticsearch and see what it gives: Elastic is 21% faster, but again there is an interesting nuance buffers the last batch will not be available for searching right away /bulk?refresh=1 root@perf3 ~ # php load_elasticsearch.php 10000 32 1000000 50 preparing... found in cache querying... finished inserting Total time: 307.47588610649 162614 docs per sec In this case Manticore is again faster by 43%. If we want to test the maximum performance, we can: Use sharding in both Elasticsearch and Manticore Let Elasticsearch buffer incoming documents at maximum Use MySQL interface to put data to Manticore Search (it’s slightly faster) Disable binlog in Manticore Search (unfortunately, you can’t do that in Elasticsearch) Here’s what it gives: Manticore: // docker run -p9306:9306 --name manticore --rm -v $(pwd)/manticore_idx:/var/lib/manticore -e searchd_binlog_path= manticoresearch/manticore:5.0.2 root@perf3 ~ # php load_manticore_sharded.php 10000 32 1000000 32 50 preparing... found in cache /tmp/bc9719fb0d26e18fc53d6d5aaaf847b4_10000_1000000 querying... finished inserting Total time: 55.874907970428 894856 docs per sec Elasticsearch: root@perf3 ~ # php load_elasticsearch_sharded.php 10000 32 1000000 32 50 preparing... found in cache querying... finished inserting Total time: 119.96515393257 416788 docs per sec But, remember the nuance: you have to spend another 13 seconds to make the documents searchable: root@perf3 ~ # curl -s -X POST "localhost:9200/_sql?format=json&pretty" -H 'Content-Type: application/json' -d'{"query": "select count(*) from user"}' { "columns" : [ { "name" : "count(*)", "type" : "long" } ], "rows" : [ [ 0 ] ] } root@perf3 ~ # time curl -XPOST "localhost:9200/user/_refresh" {"_shards":{"total":64,"successful":32,"failed":0}} real 0m13.505s user 0m0.003s sys 0m0.000s root@perf3 ~ # curl -s -X POST "localhost:9200/_sql?format=json&pretty" -H 'Content-Type: application/json' -d'{"query": "select count(*) from user"}' { "columns" : [ { "name" : "count(*)", "type" : "long" } ], "rows" : [ [ 50000000 ] ] } All in all, . The scripts used for this test can be found . Manticore is 2x faster than Elasticsearch in terms of data ingestion performance. And the data is searchable immediately after the batch is loaded, not 2 minutes later here What it’s written in Elasticsearch itself is written in Java, and the Lucene library it uses and depends on is also written in Java. Manticore is written in C++. What it gives: The code is harder to write, yes. But we are closer to the hardware, so we can make . more optimized code . No need to think about JVM heap size There is to start gc at an inappropriate moment, which can greatly affect performance. no risk for JVM garbage collector on startup which takes quite a time. No need to run a heavy JVM Open source Elasticsearch is anymore. The license was changed from Apache 2 to the Elastic License in 2021. not a pure open source Manticore is with GPLv2 license for the and the Apache 2 license for the . purely open source daemon columnar library JSON vs SQL Both , but the difference is: Elasticsearch and Manticore can do both SQL and JSON . What we love in SQL is that if use it, many things are much easier to do at the proof of concept stage. For example, here are 2 queries that do the same thing. Do you wanna spend a minute counting and brackets or … ? Elasticsearch is based on JSON by default while Manticore is SQL-first { } SQL is very limited in Elasticsearch, for example: you can’t do SELECT id you can’t INSERT/UPDATE/DELETE you can’t run service commands (create cluster, see status, etc.). In Manticore it’s the other way around: You can do everything via SQL JSON covers only basic functionality: search and data modification queries. Startup time In some cases, you need to be able to launch a service quickly. For example, in IoT (Internet of things) or some ETL scenarios. Elasticsearch . takes a long time to start up to start up with Manticore takes just a couple of seconds a table of 1.1 million documents Near-real-time vs real-time As mentioned above, by default . This can be adjusted, but then the ingestion rate will become significantly slower, as you can see above. when you put data to Elasticsearch, it becomes searchable only after a second Manticore always works in real-time mode. Full-text search Probably worth another article to explain it all. In short: both Manticore and Elasticsearch are good in terms of full-text search, have a lot in common, but there are a lot of differences, too. According to (which is important when evaluating relevance) on almost default settings . Here is the in . these objective tests Manticore can give higher relevance than Elasticsearch relevant pull request BEIR(information retrieval benchmark) Aggregations Both Manticore and Elasticsearch provide rich aggregation functionality. You probably know what Elasticsearch can do, here’s what can be done in Manticore for you to compare: Just grouping: SELECT release_year FROM films GROUP BY release_year LIMIT 5 Get aggregates: SELECT release_year, AVG(rental_rate) FROM films GROUP BY release_year LIMIT 5 Sort buckets: SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5 Group by multiple fields at the same time: SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC Get N records from each bucket, not 1: SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6 Sort inside a bucket: SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5 Filter buckets: SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3 Use to access aggregation key: GROUPBY() SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002) Group by array value: SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc Group by json node: SELECT groupby() color, count(*) from products GROUP BY meta.color Get count of distinct values: SELECT major, count(*), count(distinct age) FROM students GROUP BY major Use : GROUP_CONCAT() SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major Use after your main query and it will group the main query’s results: FACET SELECT *, price AS aprice FROM facetdemo LIMIT 10 FACET price LIMIT 10 FACET brand_id LIMIT 5 Faceting by aggregation over another attribute: SELECT * FROM facetdemo FACET brand_name by brand_id Faceting without duplicates: SELECT brand_name, property FROM facetdemo FACET brand_name distinct property Facet over expressions: SELECT * FROM facetdemo FACET INTERVAL(price,200,400,600,800) AS price_range Facet over multi-level grouping: SELECT *,INTERVAL(price,200,400,600,800) AS price_range FROM facetdemo FACET price_range AS price_range, brand_name ORDER BY brand_name asc; Sorting of facet results: SELECT * FROM facetdemo FACET brand_name BY brand_id ORDER BY FACET() ASC FACET brand_name BY brand_id ORDER BY brand_name ASC FACET brand_name BY brand_id ORDER BY COUNT(*) DESC Pagination in facet results: SELECT * FROM facetdemo FACET brand_name BY brand_id ORDER BY FACET() ASC LIMIT 0,1 FACET brand_name BY brand_id ORDER BY brand_name ASC LIMIT 2,4 FACET brand_name BY brand_id ORDER BY COUNT(*) DESC LIMIT 4; Schemaless Elasticsearch is famous for the fact that you can write anything into it. Many Elasticsearch experts recommend using static mapping, for example, : With Manticore Search, you have to create a scheme beforehand. https://octoperf.com/blog/2018/09/21/optimizing-elasticsearch/#index-mapping One of the very first things you can do is to define your indice mapping statically. But we find dynamic mapping important in the area of log management and analysis. Since we want Manticore to be easy to use for that . we have plans to enable dynamic mapping in Manticore, too Integrations Both Elasticsearch and Manticore have clients for different programming languages. MySQL wire protocol support: An important advantage of Manticore is the possibility to use MySQL clients to work with the server. Even if there is no official Manticore client for some language, there is definitely a MySQL client you can use. Using the command line than using , because the commands are much more compact and the session is supported. MySQL client for administration is more convenient curl The support for the MySQL protocol has also made it possible to support engine for tight integration between those and Manticore. MySQL/Mariadb FEDERATED In addition, Manticore can be used via . ProxySQL HTTP Elasticsearch and Manticore. JSON API is supported in both Logstash, Kibana: , but it’s a work in progress and in a beta stage. We’ll get those integrations up to speed soon. This is how you can try Manticore with Kibana: Manticore supports Kibana # download manticore beta version with support for Kibana, check https://repo.manticoresearch.com/repository/kibana_beta/ for different OS versions wget https://repo.manticoresearch.com/repository/kibana_beta/ubuntu/jammy.zip # unarchive it unzip jammy.zip # install the packages dpkg -i build/* # switch Manticore to the mode supporting Kibana mysql -P9306 -h0 -e "set global log_management = 0; set global log_management = 1;" # start Kibana pointing it to Manticore Search instance listening on port 9308 docker run -d --name kibana --rm -e ELASTICSEARCH_HOSTS=http://127.0.0.1:9308 -p 5601:5601 --network=host docker.elastic.co/kibana/kibana:7.4.2 # install php and composer, download loading script and put into Manticore 1 million docs of fake users apt install php composer php8.1-mysql wget https://gist.githubusercontent.com/sanikolaev/13bf61bbe6c39350bded7c577216435f/raw/8d8029c0d99998c901973fd9ac66a6fb920deda7/load_manticore_sharded.php composer require fakerphp/faker php load_manticore_sharded.php 10000 16 1000000 16 1 # don't forget to create an index patter in Kibana (user*) # run `docker stop kibana` to stop the Kibana server If all went well you should see: Replication Both Elasticsearch and Manticore Search use . At Manticore we decided not to reinvent the wheel and made integration with the , which is also used by Mariadb and Percona Xtradb cluster. synchronous replication Galera library An important difference in managing replication and clustering in Manticore and Elasticsearch is that with Elasticsearch , while in Manticore you don’t: to and sync up with another node: you need to edit the config to set up a replica replication is always enabled and it’s very easy to connect Sharding and distributed indexes Unlike Elasticsearch, Manticore does not yet have automatic sharding, but : combining multiple indexes into one for manual sharding is easier than in Elasticsearch Adding an index located on a remote node is also supported, just specify the remote host, port, and index name. Ease of use and learning Our thinking is that we don’t want our users, be it a developer or a devops to become experts in databases or search engines or have a PhD to be able to use Manticore products. We assume you have other things to do rather than spending hours trying to understand how this or that setting affects this or that functionality. Hence, Manticore Search should work fine in most cases even on defaults. Our ultimate goal is to make Manticore Search as easy to use and learn as possible. As mentioned previously, which we find important while you are just getting started with Manticore compared to Elasticsearch. Manticore is SQL-first - to walk you through the essential steps to get familiar with Manticore. Manticore provides interactive courses play.manticoresearch.com There is a with examples for different OSes and programming languages - . guide on how to get started https://manual.manticoresearch.com/Quick_start_guide You in public channels: , , . can talk directly to the developers Slack Telegram Forum We have a integrated with the documentation so that takes you to the search results in the documentation in special mode - it immediately rewinds to the most relevant section. This is especially handy when you need to recall some details on some setting, e.g. . special short domain mnt.cr mnt.cr/ mnt.cr/max_packet_size Cloud native Elasticsearch provides . Kubernetes operator Manticore Search provides . Helm chart Imperative and declarative usage modes In Elasticsearch, most things are only done through the API. There is no way ( ) to add mappings to a configuration file so that they are available immediately after startup. anymore Manticore, like Kubernetes, : supports two usage modes : when everything can be managed online using etc. Imperative CREATE TABLE/DROP TABLE/ALTER TABLE, CREATE CLUSTER/JOIN CLUSTER/DELETE CLUSTER : when you can define mappings in a configuration file, which gives greater portability and easier integration of Manticore into CI/CD, ETL, and other processes. Declarative Percolate Percolate or Persistent Query is when a table contains queries, not documents, and the search is performed on documents, not queries. The search results are queries that satisfy the documents. This type of search is useful for users’ subscriptions: if you subscribed, for example, to the query , then as soon as it appears on the site, you will be notified about it. Manticore provides the functionality for that as well as Elasticsearch. According to the we did a few years ago . TV > 42 inches tests throughput of this type of search in Manticore is significantly higher than in Elasticsearch What’s next? We are now developing the project in the following directions: in the ELK stack, so Kibana and Logstash (or the Opensearch alternatives) can work with it fine. We want the . We already have a . Drop-in replacement for Elasticsearch low latency that’s easier to achieve with Manticore to be available to people for log analysis beta . When you use Manticore as a log analysis solution you don’t have to think about the schemas. Schemaless mode and orchestration of shards, so you can load data into Manticore even faster and the shards will be spread out in an optimal order for better fault tolerance. Automatic sharding Further performance optimizations. We just , so you can run Manticore on cheaper hardware and make the Earth greener. want even lower latency and higher throughput Conclusions So, at the end of it all, what do we have? Manticore may now be of interest to those: Who cares about on both small and large amounts of data, low response times Who likes , SQL Who wants something , simpler than Elasticsearch to integrate search into their application faster Who wants something which starts fast, more lightweight Who cares about using software. purely open source We are continuing! ⭐⭐⭐ Your star on supports the project and makes us think we are on the right path!⭐⭐⭐ GitHub Also Published Here