[Explained] How Does A Search Engine Work?
Technical Expert in the Public Sector.
Have you ever wondered how a search engine works? We use search engines as much as we call, text or drive. But how do they work? The purpose of this article is to try to explain the basic concepts of a search engine using simple and clear examples.
A search engine consists of three important operations:
A web crawler or web spider is a piece of automated
software that systematically (based on logic or probabilistic rules) browses the Internet and
collects information (documents) to be used for the indexing process in the future. A real
world analogy of this would be you visiting every bus stop (web page) in the
city you live in (website) and taking a picture of the bus schedule at
each stop (gathering content for the index) and then vising another city and doing the same exact thing.
A search engine such as
Google has many web crawlers (known as Googlebots
in their case) since there are
billions and billions of pages on the Internet. Web crawling is a never-ending
process since the Internet is always growing. Modern day search engines also
crawl other document and media types and not only web pages.
Going to every single bus stop in every major city would be a
hectic task and therefore, web crawlers operate in parallel (while you are taking pictures of bus schedules in Toronto, your friend is doing the same exact thing in Montreal at the same exact time).
The web crawler process typically begins with a list of URLs
of web pages usually generated from the previous web crawl process. The web
crawler visits each of these web pages and detects links to other web pages.
These newly detected links are added to the list of pages to crawl.
also saves the content of the web page in order to be indexed later. The web
crawler process ends when there are no more web pages to crawl or when an
algorithmic condition is met. "Only crawl a 1,000 web pages in the next 24 hours" is an example of an
Once the web crawler has completed the content gathering
process, the index table needs to be created or updated. An index table
is used because of the speed benefits it provides when returning search engine
results to the user. The process of creating or updating of an index table is
usually a long process; however, this is acceptable since the process is hidden
from the user.
The major steps of building an index are:
the documents returned from the web crawler. Suppose the web
crawler only returned the following documents:
stop words and punctuation marks from the documents. Stop words are extremely
common words in the English language like “a”, “the” and “or”. These words are
removed in order to improve efficiency of the search engine when returning
They are also called "bag words".
linguistic processing is completed by converting each word to its root word.
For example, “climbing” to “climb” or “friends” to “friend” or “children” to
“child”. This process is called “stemming”.
4. Create an index of terms where it contains the document and the frequency in which the word occurs. Below is a sample of the index based on the content above:
This example is a very simple indexing method. Search engines today use techniques that are more complex. The frequency of a term in a document is an important property; however, other properties (such as positioning of a term in a document or the geographical location of the server hosting the document, the age of the content) can also be added to the index table.
The document ranking process occurs once the user enters words (the query) into the search engine and presses okay. Suppose the user searched the following query “I love humpty chips”. Obviously, the search engine returns Documents D1 and D3, but which document is more relevant to the user? This is not an issue for this user since the document set is too small. But what if it is not?
Search engines today rank documents using a sophisticated technique based on many factors (secret sauce) such as the frequency property in our previous example. The reverse engineering what those factors are is what some tech and marketing professionals call Search Engine Optimization (SEO).
The goal of an SEO professional is to exploit those factors, if discovered, to improve the ranking of a web page. Unfortunately, many search engine companies are always improving and refining their paging algorithms. This in turn makes knowledge gained in the SEO industry obsolete.
Here is a selected list of readings you can do if you would like to gain more knowledge of how a search engine works:
Subscribe to get your daily round-up of top tech stories!