Have you ever wondered how a search engine works? We use search engines as much as we call, text or drive. But how do they work? The purpose of this article is to try to explain the basic concepts of a search engine using simple and clear examples.
A search engine consists of three important operations: Web crawling , Indexing and Ranking .
A web crawler or web spider is a piece of automated software that systematically (based on logic or probabilistic rules) browses the Internet and collects information (documents) to be used for the indexing process in the future. A real world analogy of this would be you visiting every bus stop (web page) in the city you live in (website) and taking a picture of the bus schedule at each stop (gathering content for the index) and then vising another city and doing the same exact thing.
A search engine such as Google has many web crawlers (known as Googlebots in their case) since there are billions and billions of pages on the Internet. Web crawling is a never-ending process since the Internet is always growing. Modern day search engines also crawl other document and media types and not only web pages.
Going to every single bus stop in every major city would be a hectic task and therefore, web crawlers operate in parallel (while you are taking pictures of bus schedules in Toronto, your friend is doing the same exact thing in Montreal at the same exact time).
The web crawler process typically begins with a list of URLs of web pages usually generated from the previous web crawl process. The web crawler visits each of these web pages and detects links to other web pages. These newly detected links are added to the list of pages to crawl.
The crawler also saves the content of the web page in order to be indexed later. The web crawler process ends when there are no more web pages to crawl or when an algorithmic condition is met. "Only crawl a 1,000 web pages in the next 24 hours" is an example of an algorithmic condition.
Once the web crawler has completed the content gathering process, the index table needs to be created or updated. An index table is used because of the speed benefits it provides when returning search engine results to the user. The process of creating or updating of an index table is usually a long process; however, this is acceptable since the process is hidden from the user.
The major steps of building an index are:
1. Collect the documents returned from the web crawler. Suppose the web crawler only returned the following documents:
2. Remove stop words and punctuation marks from the documents. Stop words are extremely common words in the English language like “a”, “the” and “or”. These words are removed in order to improve efficiency of the search engine when returning results. They are also called "bag words".
3. Additional linguistic processing is completed by converting each word to its root word. For example, “climbing” to “climb” or “friends” to “friend” or “children” to “child”. This process is called “stemming”.
4. Create an index of terms where it contains the document and the frequency in which the word occurs. Below is a sample of the index based on the content above:
This example is a very simple indexing method. Search engines today use techniques that are more complex. The frequency of a term in a document is an important property; however, other properties (such as positioning of a term in a document or the geographical location of the server hosting the document, the age of the content) can also be added to the index table.
The document ranking process occurs once the user enters words (the query) into the search engine and presses okay. Suppose the user searched the following query “I love humpty chips”. Obviously, the search engine returns Documents D1 and D3, but which document is more relevant to the user? This is not an issue for this user since the document set is too small. But what if it is not?
Search engines today rank documents using a sophisticated technique based on many factors (secret sauce) such as the frequency property in our previous example. The reverse engineering what those factors are is what some tech and marketing professionals call Search Engine Optimization (SEO).
The goal of an SEO professional is to exploit those factors, if discovered, to improve the ranking of a web page. Unfortunately, many search engine companies are always improving and refining their paging algorithms. This in turn makes knowledge gained in the SEO industry obsolete.
Here is a selected list of readings you can do if you would like to gain more knowledge of how a search engine works:
1. The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Page (Google Founders)
2. Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.