Have you ever wondered how a search engine works?  We use search engines as much as we call, text or drive. But how do they work? The purpose of this article is to try to explain the basic concepts of a search engine using simple and clear examples. A search engine consists of three important operations:
 Web
 crawling
, Indexing
 and Ranking
. Web Crawling A web crawler or web spider is a piece of automated
 software that systematically (based on logic or probabilistic rules) browses the Internet and
 collects information (documents) to be used for the indexing process in the future. A real
 world analogy of this would be you visiting every bus stop (web page) in the
 city you live in (website) and taking a picture of the bus schedule at
 each stop (gathering content for the index) and then vising another city and doing the same exact thing. A search engine such as
 Google has many web crawlers (known as in their case) since there are
 billions and billions of pages on the Internet. Web crawling is a never-ending
 process since the Internet is always growing. Modern day search engines also
 crawl other document and media types and not only web pages. Googlebots Going to every single bus stop in every major city would be a
 hectic task and therefore, web crawlers operate in parallel (while you are taking pictures of bus schedules in Toronto, your friend is doing the same exact thing in Montreal at the same exact time). The web crawler process typically begins with a list of URLs
 of web pages usually generated from the previous web crawl process. The web
 crawler visits each of these web pages and detects links to other web pages.
 These newly detected links are added to the list of pages to crawl. The crawler
 also saves the content of the web page in order to be indexed later. The web
 crawler process ends when there are no more web pages to crawl or when an
 algorithmic condition is met. "Only crawl a 1,000 web pages in the next 24 hours" is an example of an
 algorithmic condition. Indexing Once the web crawler has completed the content gathering
 process, the index table needs to be created or updated. An index table
 is used because of the speed benefits it provides when returning search engine
 results to the user. The process of creating or updating of an index table is
 usually a long process; however, this is acceptable since the process is hidden
 from the user. The major steps of building an index are: 1. Collect
 the documents returned from the web crawler. Suppose the web
 crawler only returned the following documents: 2. Remove
 stop words and punctuation marks from the documents. Stop words are extremely
 common words in the English language like “a”, “the” and “or”. These words are
 removed in order to improve efficiency of the search engine when returning
 results.
 They are also called "bag words". 3. Additional
 linguistic processing is completed by converting each word to its root word.
 For example, “climbing” to “climb” or “friends” to “friend” or “children” to
 “child”. This process is called “stemming”. 4. Create an index of terms where it contains the document and the frequency in which the word occurs. Below is a sample of the index based on the content above: This example is a very simple indexing method. Search engines today use techniques that are more complex. The frequency of a term in a document is an important property; however, other properties (such as positioning of a term in a document or the geographical location of the server hosting the document, the age of the content) can also be added to the index table. Ranking The document ranking process occurs once the user enters words (the query) into the search engine and presses okay. Suppose the user searched the following query “I love humpty chips”. Obviously, the search engine returns Documents D1 and D3, but which document is more relevant to the user? This is not an issue for this user since the document set is too small. But what if it is not? Search engines today rank documents using a sophisticated technique based on many factors (secret sauce) such as the frequency property in our previous example. The reverse engineering what those factors are is what some tech and marketing professionals call Search Engine Optimization (SEO). The goal of an SEO professional is to exploit those factors, if discovered, to improve the ranking of a web page. Unfortunately, many search engine companies are always improving and refining their paging algorithms. This in turn makes knowledge gained in the SEO industry obsolete. Further Readings Here is a selected list of readings you can do if you would like to gain more knowledge of how a search engine works: 1. by Sergey Brin and Larry Page (Google Founders) The Anatomy of a Large-Scale Hypertextual Web Search Engine 2. by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Introduction to Information Retrieval

Google

Solve Problems

Read My Stories

Too Long; Didn't Read

Claim your SEMrush All-in-one SEO tool FREE trial today

[Explained] How Does A Search Engine Work?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Why Being Likable At Work Matters

Windows Sticky Keys Exploit: The War Veteran That Never Dies

Is ‘bias for action’ making product managers lazier?

Zero People Charged With Online Pirating, Swedish Prosecutor's Office Reports

What No One Told Me About Being a Product Manager at an Early Stage Startup

Why Being Likable At Work Matters

Windows Sticky Keys Exploit: The War Veteran That Never Dies

Is ‘bias for action’ making product managers lazier?

Zero People Charged With Online Pirating, Swedish Prosecutor's Office Reports

What No One Told Me About Being a Product Manager at an Early Stage Startup

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps