From practical point of view the “product” we will have in the end will be barely capable of being used in production for mass web crawling, but if you just need to be able to crawl your own site or a site of your competitor or someone else’s and want to use an advanced search syntax and not just grep — this article should be useful for you.
The article should be also useful for those who are just starting with docker-compose or Manticore Search.
Our solution will be based on:
Once we are done you should be able to run it via docker-compose like this:
domain=who.int docker-compose up
which will start crawling and indexing https://who.int and will immediately run another container with a web server, so you can search in the crawled pages:
So what technologies will we use in our solution?
Wget
Everyone probably knows wget. When you need to download something in terminal in Linux, FreeBSD or MacOS most likely you will use wget. But did you know that wget can not just download a single file, but can be easily used as a simple web crawler which respects robots.txt, can follow links and doesn’t overload your system? Well if not, you know now. Yes, it doesn’t come with a load distribution among a network of your crawling servers or even ability to do searches in parallel. It’s actually not scalable at all, but it’s simple and it’s tried and trusted tool which suits our idea very well since the whole job can be done in just one call of the wget:
wget -nv -r -H -nd --connect-timeout=2 --read-timeout=10 --tries=1 --follow-tags=a -R "*.css*,*.js*,*.png,*.jpg,*.gif" "http://${domain}/" --domains=${domain} | php load.php
Let’s go through the most important parameters:
load.php
This is a simple and straightforward 15 lines of code script which:
Here is the full script with each line commented:
<?php
$f = fopen('php://stdin', 'r'); # we'll be waiting for data at STDIN
$manticore = new mysqli('manticore', '', '', '', 9306); # let's connect to Manticore Search via MySQL protocol
$manticore->query("CREATE TABLE IF NOT EXISTS rt(title text, body text, url text stored) html_strip='1' html_remove_elements='style,script,a' morphology='stem_en' index_sp='1'"); /* creating a table "rt" if it doesn't exist with the following settings:
- html_strip='1': stripping HTML is on
- html_remove_elements='style,script,a': for HTML tags <style>/<script>/<a> we don't need their contents, so we are stripping them completely
- morphology='stem_en': we'll use English stemmer as a morphology processor
- index_sp='1': we'll also index sentences and paragraphs for more advanced full-text search capabilities and better relevance
*/
while (!feof($f)) { # reading from STDIN while there's something
$s = fgets($f); /* getting one line. Here is an example of wget returns:
2020-04-08 07:39:33 URL:https://www.who.int/westernpacific/ [98667/98667] -> "index.html.3" [1]
which means that:
- the original URL was https://www.who.int/westernpacific/
- that it saved the contents to index.html.3
*/
if (!preg_match('/URL:(?<url>http.*?) \[.*?\] -> "(?<path>.*?)"/', $s, $match)) continue; # if wget returns smth else we are just skipping it, otherwise we use regexp to put the url and the path to $match
do { # it may be that wget returns the info about a download earlier than the file appears, so we are looping until can read from the file:
$content = @file_get_contents('./'.$match['path']); # reading from the file
usleep(10000); # sleeping 10 milliseconds
} while (!$content); # end the loop when we have the content
if (preg_match('/<title>(?<title>.*?)<\/title>/is', $content, $content_match)) $title = trim(html_entity_decode($content_match['title'])); # here we are doing a simple HTML page parsing to get <title> from that
else continue; # we are not interested in pages without a title
echo "{$match['path']}: $title {$match['url']} ".strlen($content)." bytes\n"; # let's say something about our progress
$manticore->query("REPLACE INTO rt (id,title,url,body) VALUES(".crc32($title).",'".$manticore->escape_string($title)."','".$manticore->escape_string($match['url'])."','".$manticore->escape_string($content)."')"); # and we are finally putting the contents to Manticore. We use crc32(title) as a document ID to avoid duplicates.
} # and we are going back to the next page wget reports as downloaded
So as soon as wget downloads at least something it will appear in Manticore immediately and will be searchable. Your data collection will grow until wget can’t download anything else or until you stop the container.
Manticore Search
Another important component is Manticore Search.
Manticore is a lightweight database written in C++ created specifically for search purposes with a powerful full-text search capabilities
It can speak SQL over MySQL protocol as well as JSON over HTTP. What’s important for our purpose is that:
So all we need to do to hook up Manticore in our case is these 3 lines in docker-compose.yml:
services:
manticore:
image: manticoresearch/manticore:3.4.0
Docker compose file
Docker Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration
Our docker-compose YAML looks like this:
version: '2.2'
services:
# Manticore Search is a small yet powerful database for search with awesome full-text search capabilities
manticore:
# we'll just use their official image
image: manticoresearch/manticore:3.4.0
# and create a volume for data persistency
volumes:
- ./data:/var/lib/manticore
# we also need php
php:
# which we'll build ourselves from Dockerfile
build: php
# no point to run the php container before manticore, hence the dependency
depends_on:
- manticore
# the command below just runs wget to start crawling the domain passed in the env. variable
# and lets the wget output flow to "php load.php" which insert into into Manticore Search
command: /bin/bash -c 'wget -nv -r -H -nd --connect-timeout=2 --read-timeout=10 --tries=1 --follow-tags=a -R "*.css*,*.js*,*.png,*.jpg,*.gif" "http://${domain}/" --domains=${domain} 2>&1 | php load.php'
# let's also add a tiny php script to visualize what we have in Manticore
web:
# we'll use php 7.2. + Apache for that
image: php:7.2-apache
# it also depends on Manticore
depends_on:
- manticore
# let's bind it to 8082 port locally
ports:
- 8082:80
# we'll mirror folder "www" to /var/www/html/ inside the web server container so ./www/index.php will be the front page
volumes:
- ./www/:/var/www/html/
and there is also a Dockerfile for php+wget+mysql extension:
# Let's take php 7.4 as a base image
FROM php:7.4-cli
# We'll also install wget and PHP mysqli extension
RUN apt-get update \
&& apt-get -y install wget \
&& docker-php-source extract \
&& docker-php-ext-install mysqli \
&& docker-php-source delete
# We'll use load.php, so we need to copy it to the image
COPY load.php /usr/src/myapp/
# And let's change the working dir
WORKDIR /usr/src/myapp
Please go through the comments in them. In a nutshell it includes 3 services:
Feel free to override the port from 8082 to whatever you want. We also use the environment variable $domain to specify the domain to crawl. So when you run it like this:
domain=who.int docker-compose up
it runs the above 3 services and starts crawling:
snikolaev@dev:~/crawler$ domain=who.int docker-compose up
Starting crawler_manticore_1 … done
Recreating crawler_web_1 … done
Starting crawler_php_1 … done
...
php_1 | data.5: GHO https://www.who.int/data/gho 125537 bytes
php_1 | fact-sheets.4: Fact sheets https://www.who.int/news-room/fact-sheets 83345 bytes
php_1 | facts-in-pictures.3: Facts in pictures https://www.who.int/news-room/facts-in-pictures 70227 bytes
php_1 | publications.7: WHO | Publications https://www.who.int/publications/en/ 92069 bytes
php_1 | questions-answers.3: WHO | Online Q&A https://www.who.int/features/qa/en/ 78145 bytes
php_1 | popular.3: Health topics https://www.who.int/health-topics/ 123263 bytes
php_1 | ebola-virus-disease.8: Ebola virus disease https://www.who.int/health-topics/ebola/ 112116 bytes
Search bar
The last component we haven’t covered yet is index.php which runs when you open http://hostname:8082 (or another port if you changed it in the compose file). The full script is just 13 lines of code:
<form><h1>Manticore</h1><input name="search" type="text" style="width: 50%; border: 1px solid" value="<?=$_GET['search']?>"></form>
<hr>
<?php
if (isset($_GET['search'])) { # we have a search request, let's process it
$ch = curl_init(); # initializing curl
curl_setopt($ch, CURLOPT_URL,"http://manticore:9308/sql"); # we'll connect to Manticore's /sql endpoint via HTTP. There's also /json/search/ which gives much more granular control, but for the sake of simplicity we'll use the /sql endpoint
curl_setopt($ch, CURLOPT_POST, 1); # we'll send via POST
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); # we need the response back, don't output it
curl_setopt($ch, CURLOPT_POSTFIELDS, "mode=raw&query=SELECT url, highlight({}, 'title') title, highlight({}, 'body') body FROM rt WHERE MATCH('{$_GET['search']}') LIMIT 10"); /* here we are SELECTing :
- url
- highlighted title
- highlighted body
- from the index called "rt"
- we want all documents that MATCH() our search query
- and we need only the first 10, hence LIMIT 10
*/
if ($json = json_decode(curl_exec($ch))) { # running the query and decoding the JSON
foreach ($json->data as $result) echo "<small>{$result->url}</small><br><a href=\"{$result->url}\">{$result->title}</a><br>{$result->body}<br><br>"; # and here we just output the results: url, title and body
}
}
Here unlike load.php we connect to Manticore over HTTP and use it’s JSON api endpoint /sql which allows to transmit any SQL command over HTTP. In a production environment it might make more sense to use Manticore’s /json/search endpoint which allows to break down the request into pieces much more granularly which is often important if your search form is not just one text area, but multi-field or in other cases. But we don’t need that all now. The logic of the script is simple:
That’s it. Nothing complicated.
What can it do?
Let’s now see what we can do with what we’ve built. Why didn’t we just dump wget output to files and use grep to search in them? Here is why:
Not just our search engine finds what matches your query, but it highlights the results and sorts them properly using improved ranking formula similar to BM25. For example as you can see on this picture the results containing “IPC precaution recommendations” go first since they have the whole phrase:
Second, you can use Manticore’s extended query syntax to do many interesting things. For example you might want to find only those documents that have “covid” and “caught” in the same sentence or paragraph:
Or you can match by a whole phrase, use OR and NOT and many more.
Third, do you remember when we were doing CREATE TABLE we turned on English stemming? Here is how we can now use it — if I enter “coronaviruses” it finds just “coronavirus” too:
So even though the crawling part is very basic the search part of our solution is quite powerful. You definitely can’t do anything like this with wget.
How do I run it myself?
git clone https://github.com/manticoresoftware/demos.git manticore_demos
cd manticore_demos/crawler/
domain=who.int docker-compose up
If you run it first time you’ll have to wait few minutes for docker to download the images to build the php service. Afterwards it will start crawling http://who.int (or another domain you specify), and the search UI will be available at http://localhost:8082 unless you run it on a remote server.
Thanks for reading! You can access the code here.