Build A Web Crawler with Search bar Using Wget and Manticore [A Step By Step Guide]

Hi everyone. In this article we are going to talk about how can you write a simple web scraper and a little search application using well known existing technologies which you perhaps didn’t know they can do that.

From practical point of view the “product” we will have in the end will be barely capable of being used in production for mass web crawling, but if you just need to be able to crawl your own site or a site of your competitor or someone else’s and want to use an advanced search syntax and not just grep — this article should be useful for you.

The article should be also useful for those who are just starting with docker-compose or Manticore Search.

TL;DR

Our solution will be based on:

wget crawling a site recursively
tiny php script to pass the crawled content from wget to Manticore Search
tiny php script to make a minimalistic search UI
all wrapped in docker and docker-compose

Once we are done you should be able to run it via docker-compose like this:

domain=who.int docker-compose up

which will start crawling and indexing https://who.int and will immediately run another container with a web server, so you can search in the crawled pages:

Technologies

So what technologies will we use in our solution?

Wget

Everyone probably knows wget. When you need to download something in terminal in Linux, FreeBSD or MacOS most likely you will use wget. But did you know that wget can not just download a single file, but can be easily used as a simple web crawler which respects robots.txt, can follow links and doesn’t overload your system? Well if not, you know now. Yes, it doesn’t come with a load distribution among a network of your crawling servers or even ability to do searches in parallel. It’s actually not scalable at all, but it’s simple and it’s tried and trusted tool which suits our idea very well since the whole job can be done in just one call of the wget:

wget -nv -r -H -nd --connect-timeout=2 --read-timeout=10 --tries=1 --follow-tags=a -R "*.css*,*.js*,*.png,*.jpg,*.gif" "http://${domain}/" --domains=${domain} | php load.php

Let’s go through the most important parameters:

-nv disables verbosity since we don’t need it in wget’s output which will be parsing
-r turns on recursive retrieving. Obviously the most important part for us
-H enables spanning across hosts when doing recursive retrieving
-nd disables creating directories when retrieving recursively
--follow-tags=a limits the HTML tags to follow by just the hyperlink tag <a>
-R "*.css*,*.js*,*.png,*.jpg,*.gif" lists patterns to ignore. Obviously we don’t need any images or css/js files for full-text search, so we are ignoring them
"http://${domain}/" is our starting point. It will be the first page wget will download
--domains=${domain} lets us define the domains to be followed. In our case we are limiting by the same domain we are crawling
| php load.php and after all we want to pipe wget’s output to load.php

load.php

This is a simple and straightforward 15 lines of code script which:

makes a connection to Manticore Search using a MySQL library
creates a new table if it doesn’t exist yet with the morphology settings we need
reads info about downloaded pages from wget at STDIN
reads each page and puts it to Manticore

Here is the full script with each line commented:

<?php
$f = fopen('php://stdin', 'r'); # we'll be waiting for data at STDIN
$manticore = new mysqli('manticore', '', '', '', 9306); # let's connect to Manticore Search via MySQL protocol
$manticore->query("CREATE TABLE IF NOT EXISTS rt(title text, body text, url text stored) html_strip='1' html_remove_elements='style,script,a' morphology='stem_en' index_sp='1'"); /* creating a table "rt" if it doesn't exist with the following settings:
- html_strip='1': stripping HTML is on
- html_remove_elements='style,script,a': for HTML tags <style>/<script>/<a> we don't need their contents, so we are stripping them completely
- morphology='stem_en': we'll use English stemmer as a morphology processor
- index_sp='1': we'll also index sentences and paragraphs for more advanced full-text search capabilities and better relevance
*/
while (!feof($f)) { # reading from STDIN while there's something
    $s = fgets($f); /* getting one line. Here is an example of wget returns:
    2020-04-08 07:39:33 URL:https://www.who.int/westernpacific/ [98667/98667] -> "index.html.3" [1]
    which means that:
    - the original URL was https://www.who.int/westernpacific/
    - that it saved the contents to index.html.3
    */
    if (!preg_match('/URL:(?<url>http.*?) \[.*?\] -> "(?<path>.*?)"/', $s, $match)) continue; # if wget returns smth else we are just skipping it, otherwise we use regexp to put the url and the path to $match
    do { # it may be that wget returns the info about a download earlier than the file appears, so we are looping until can read from the file:
        $content = @file_get_contents('./'.$match['path']); # reading from the file
        usleep(10000); # sleeping 10 milliseconds
    } while (!$content); # end the loop when we have the content
    if (preg_match('/<title>(?<title>.*?)<\/title>/is', $content, $content_match)) $title = trim(html_entity_decode($content_match['title'])); # here we are doing a simple HTML page parsing to get <title> from that
    else continue; # we are not interested in pages without a title
    echo "{$match['path']}: $title {$match['url']} ".strlen($content)." bytes\n"; # let's say something about our progress
    $manticore->query("REPLACE INTO rt (id,title,url,body) VALUES(".crc32($title).",'".$manticore->escape_string($title)."','".$manticore->escape_string($match['url'])."','".$manticore->escape_string($content)."')"); # and we are finally putting the contents to Manticore. We use crc32(title) as a document ID to avoid duplicates.
} # and we are going back to the next page wget reports as downloaded

So as soon as wget downloads at least something it will appear in Manticore immediately and will be searchable. Your data collection will grow until wget can’t download anything else or until you stop the container.

Manticore Search

Another important component is Manticore Search.

Manticore is a lightweight database written in C++ created specifically for search purposes with a powerful full-text search capabilities

It can speak SQL over MySQL protocol as well as JSON over HTTP. What’s important for our purpose is that:

it can strip HTML
it has built-in NLP capabilities so we can split our texts into words, sentences and paragraphs efficiently and use stemmed forms of words (so e.g. “running” will find “run” etc.)
it’s official docker image doesn’t require any configuration at all by default so we can just use 2 SQL queries: one to create a new table and another to add a new document to it
it starts in milliseconds and is very cost-efficient in terms of RAM. No java heap which takes all your memory or garbage collection which ruins your search performance
adding a document requires just one line of code to make an SQL query

So all we need to do to hook up Manticore in our case is these 3 lines in docker-compose.yml:

services:
 manticore:
   image: manticoresearch/manticore:3.4.0

Docker compose file

Docker Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration

Our docker-compose YAML looks like this:

version: '2.2'

services:
  # Manticore Search is a small yet powerful database for search with awesome full-text search capabilities
  manticore:
    # we'll just use their official image
    image: manticoresearch/manticore:3.4.0
    # and create a volume for data persistency
    volumes:
      - ./data:/var/lib/manticore
  # we also need php
  php:
    # which we'll build ourselves from Dockerfile
    build: php
    # no point to run the php container before manticore, hence the dependency
    depends_on:
      - manticore
    # the command below just runs wget to start crawling the domain passed in the env. variable
    # and lets the wget output flow to "php load.php" which insert into into Manticore Search
    command: /bin/bash -c 'wget -nv -r -H -nd --connect-timeout=2 --read-timeout=10 --tries=1 --follow-tags=a -R "*.css*,*.js*,*.png,*.jpg,*.gif" "http://${domain}/" --domains=${domain} 2>&1 | php load.php'
  # let's also add a tiny php script to visualize what we have in Manticore
  web:
    # we'll use php 7.2. + Apache for that
    image: php:7.2-apache
    # it also depends on Manticore
    depends_on:
      - manticore
    # let's bind it to 8082 port locally
    ports:
      - 8082:80
    # we'll mirror folder "www" to /var/www/html/ inside the web server container so ./www/index.php will be the front page
    volumes:
      - ./www/:/var/www/html/

and there is also a Dockerfile for php+wget+mysql extension:

# Let's take php 7.4 as a base image
FROM php:7.4-cli
# We'll also install wget and PHP mysqli extension
RUN apt-get update \
&& apt-get -y install wget \
&& docker-php-source extract \
&& docker-php-ext-install mysqli \
&& docker-php-source delete
# We'll use load.php, so we need to copy it to the image
COPY load.php /usr/src/myapp/
# And let's change the working dir
WORKDIR /usr/src/myapp

Please go through the comments in them. In a nutshell it includes 3 services:

manticore: just using the official image
php: we build it ourselves from php/Dockerfile php+wget+mysqli extension and we copy the load.php script to it — from Dockerfile. Depends on manticore
web: from php+apache official image. Depends on manticore

Feel free to override the port from 8082 to whatever you want. We also use the environment variable $domain to specify the domain to crawl. So when you run it like this:

domain=who.int docker-compose up

it runs the above 3 services and starts crawling:

snikolaev@dev:~/crawler$ domain=who.int docker-compose up
Starting crawler_manticore_1 … done
Recreating crawler_web_1 … done
Starting crawler_php_1 … done
...
php_1        | data.5: GHO https://www.who.int/data/gho 125537 bytes
php_1        | fact-sheets.4: Fact sheets https://www.who.int/news-room/fact-sheets 83345 bytes
php_1        | facts-in-pictures.3: Facts in pictures https://www.who.int/news-room/facts-in-pictures 70227 bytes
php_1        | publications.7: WHO | Publications https://www.who.int/publications/en/ 92069 bytes
php_1        | questions-answers.3: WHO | Online Q&A https://www.who.int/features/qa/en/ 78145 bytes
php_1        | popular.3: Health topics https://www.who.int/health-topics/ 123263 bytes
php_1        | ebola-virus-disease.8: Ebola virus disease https://www.who.int/health-topics/ebola/ 112116 bytes

Search bar

The last component we haven’t covered yet is index.php which runs when you open http://hostname:8082 (or another port if you changed it in the compose file). The full script is just 13 lines of code:

<form><h1>Manticore</h1><input name="search" type="text" style="width: 50%; border: 1px solid" value="<?=$_GET['search']?>"></form>
<hr>
<?php
if (isset($_GET['search'])) { # we have a search request, let's process it
    $ch = curl_init(); # initializing curl
    curl_setopt($ch, CURLOPT_URL,"http://manticore:9308/sql"); # we'll connect to Manticore's /sql endpoint via HTTP. There's also /json/search/ which gives much more granular control, but for the sake of simplicity we'll use the /sql endpoint
    curl_setopt($ch, CURLOPT_POST, 1); # we'll send via POST
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); # we need the response back, don't output it
    curl_setopt($ch, CURLOPT_POSTFIELDS, "mode=raw&query=SELECT url, highlight({}, 'title') title, highlight({}, 'body') body FROM rt WHERE MATCH('{$_GET['search']}') LIMIT 10"); /* here we are SELECTing :
 - url
 - highlighted title
 - highlighted body
 - from the index called "rt"
 - we want all documents that MATCH() our search query
 - and we need only the first 10, hence LIMIT 10
*/
    if ($json = json_decode(curl_exec($ch))) { # running the query and decoding the JSON
        foreach ($json->data as $result) echo "<small>{$result->url}</small><br><a href=\"{$result->url}\">{$result->title}</a><br>{$result->body}<br><br>"; # and here we just output the results: url, title and body
    }
}

Here unlike load.php we connect to Manticore over HTTP and use it’s JSON api endpoint /sql which allows to transmit any SQL command over HTTP. In a production environment it might make more sense to use Manticore’s /json/search endpoint which allows to break down the request into pieces much more granularly which is often important if your search form is not just one text area, but multi-field or in other cases. But we don’t need that all now. The logic of the script is simple:

render a simple search form with text area named “search”
if you press enter the form sends the typed value as an http parameter “search”
then the script just takes the value and passes it to Manticore in a very compact and clear SQL query: SELECT url, highlight({}, 'title') title, highlight({}, 'body') body FROM rt WHERE MATCH('{$_GET['search']}') LIMIT 10");
gets the results
and renders them as HTML

That’s it. Nothing complicated.

What can it do?

Let’s now see what we can do with what we’ve built. Why didn’t we just dump wget output to files and use grep to search in them? Here is why:

Not just our search engine finds what matches your query, but it highlights the results and sorts them properly using improved ranking formula similar to BM25. For example as you can see on this picture the results containing “IPC precaution recommendations” go first since they have the whole phrase:

Second, you can use Manticore’s extended query syntax to do many interesting things. For example you might want to find only those documents that have “covid” and “caught” in the same sentence or paragraph:

Or you can match by a whole phrase, use OR and NOT and many more.

Third, do you remember when we were doing CREATE TABLE we turned on English stemming? Here is how we can now use it — if I enter “coronaviruses” it finds just “coronavirus” too:

So even though the crawling part is very basic the search part of our solution is quite powerful. You definitely can’t do anything like this with wget.

How do I run it myself?

git clone https://github.com/manticoresoftware/demos.git manticore_demos
cd manticore_demos/crawler/
domain=who.int docker-compose up

If you run it first time you’ll have to wait few minutes for docker to download the images to build the php service. Afterwards it will start crawling http://who.int (or another domain you specify), and the search UI will be available at http://localhost:8082 unless you run it on a remote server.

Thanks for reading! You can access the code here.