How do sites like Google and Facebook search large databases in seconds?

If data is the new oil, so are databases the new oil refineries — Aditya Chandupatla

Having been working as a software engineer in the database field for quite some time now, I can assure you that the time database engineers spend in squeezing out the last drop of efficiency is nothing short of remarkable.

I will mention few of the database internal features which are responsible for speeding up the search process:

1. Use of sophisticated data structures. This is the most fundamental thing to get it right. The data is not stored sequentially, they are stored in the form of trees (most likely B-Tree’s, or some variant of it.) However, it is the most easiest thing to do. In fact, no one ever re-writes the industry standards of data structures used in databases, unless they really really need to — which is highly unlikely.

2. Data is distributed. Using sophisticated data structures alone doesn’t cut the mustard for the big-shots (like Google and Facebook) which manage petabytes of data. Hence they distribute. Distributing the data not only allows for efficient search but it also serves as an extraordinary fault tolerance mechanism (in the event of a disaster.) You might wonder how does distributing data helps in search process. To illustrate, consider an example of a hash function [math]f(x)[/math] which takes an input row and produces a unique integer output value in the range [math][a, b][/math]. The input can be virtually anything — because it is the user who is providing it. Whereas, the output is determined by the hash function which produces discrete integers. Each integer is then associated with a separate-disk. In other words, each input row gets stored in a particular disk.

Teradata Block Diagram

(AMP stands for Access Module Processor which is like a virtual process. VDISK is a virtual disk)

Therefore, when the time comes for retrieval, only the process associated with that particular disk will function and the others will not. (Assuming that each disk has a process associated with it. This architecture is commonly referred to as Shared-Nothing Architecture) This idea has been so successful that it has been applied in several varieties such as follows:

Node level distribution: Database, here, is not a process in a single machine, but a set of systems. For instance, we often hear the term 3-node database.

Multi-Node Database

Distribution within a database: I have explained this example above while I was describing the use of hash function.
Distribution inside a table: This technique is often referred to as table-partitioning. Similar to the above two approaches, the data is partitioned. But here, it is even more fine-grained — at the level of the table.

Table-partitioning

3. Temperature of the data. Depending upon the frequency of the access, the data can be classified, broadly, into 3 categories: Hot, Warm and Cold. Each of which is stored in disks which offer varying retrieval speeds.

For a practical example, your most recent Facebook messages on messenger from last night, will be considered as hot data and be stored in SSD’s. On the other hand, the profile picture of all your friends whom you are stalking (:P) will be considered as warm. Finally, the cover photo on your wall, which you never update, and most likely, never access, is treated as cold.

4. Efficiency. Search is not completely predicated on retrieval speeds and storage architectures alone. Sometimes, the bottleneck might be the CPU and not the I/O. Therefore, we see databases offering features such as efficient table-joins, and GPU-based databases, to mention a few.

Lastly, it is vital to remember that optimising the database alone doesn’t bring you all the efficiency you need. The application which interfaces with the database and serves the user must also be equally optimised. Things such as web-cache, machine learning based predictions by taking user’s browser-history, etc., come into the picture.

So, whenever you think of companies such as Google and Facebook, which inarguably have huge amounts of user data, don’t assume that they are going to search through each and every row in their database to get you the results. Besides, I highly reckon, that the techniques mentioned here are only the tip of the ice-berg which are considered to be industry-standards. There is a lot more stuff going on under the hood!

Originally published at www.quora.com.