Database Indexing is the most common way known and utilized by backend developers to optimize database queries. In this article, we will be discussing about database indexing in detail.
A database index allows a query to retrieve data from a database in an efficient manner. In simpler words, indexing is a way to get an unordered table into an order that will maximize efficiency when searching for a record.
Indexes can be related to specific tables and consist of one or more keys. Also, a table can have multiple indexes built from it.
When a database table is unindexed, there won’t be any clear order of the rows, thus, to fulfill any query, it will need to search through the rows linearly, that is, the query will have to search through each row to find the rows with the matching condition. As you can imagine, this isn’t ideal and can be a problem when looking inside a database table with huge amount of data.
For example, we have a table as shown below:
COMPANY_ID |
UNIT |
UNIT_COST |
---|---|---|
10 |
12 |
1.15 |
12 |
12 |
1.05 |
14 |
18 |
1.31 |
18 |
18 |
1.34 |
11 |
24 |
1.15 |
16 |
12 |
1.31 |
10 |
12 |
1.15 |
12 |
24 |
1.3 |
18 |
6 |
1.34 |
18 |
12 |
1.35 |
14 |
12 |
1.95 |
21 |
18 |
1.36 |
12 |
12 |
1.05 |
20 |
6 |
1.31 |
18 |
18 |
1.34 |
11 |
24 |
1.15 |
14 |
24 |
1.05 |
And then, we want to run a query as following:
SELECT
company_id,
units,
unit_cost
FROM
index_test
WHERE
company_id = 18
In this particular case, the database would have to search through all 17 records in the order they appear in the table, from top to bottom, one at a time, to look for all the potential instances of company_id
as 18.
This will only get more and more time consuming as the size of the table increases. How indexing can help here? Indexing can help us set up the column with the search condition on (company_id
in this case) in a sorted manner to optimize the query performance.
With an index on the company_id
column, the table would look like this:
COMPANY_ID |
UNIT |
UNIT_COST |
---|---|---|
10 |
12 |
1.15 |
10 |
12 |
1.15 |
11 |
24 |
1.15 |
11 |
24 |
1.15 |
12 |
12 |
1.05 |
12 |
24 |
1.3 |
12 |
12 |
1.05 |
14 |
18 |
1.31 |
14 |
12 |
1.95 |
14 |
24 |
1.05 |
16 |
12 |
1.31 |
18 |
18 |
1.34 |
18 |
6 |
1.34 |
18 |
12 |
1.35 |
18 |
18 |
1.34 |
20 |
6 |
1.31 |
21 |
18 |
1.36 |
Now, the database can simply search for company_id
equal to 18 and return all the requested columns for that row, and then move to the next row. If the next row also has the company_id
as 18 again, then it will also return the request columns for this row, but if the next row has the company_id
as 18, the database knows that it can stop the search here, and finish the response.
This was a rather simple explanation of what database indexes are and what they can do, but there’s a lot more going on in the process. Let’s take a deeper look into how indexing works.
In reality, the database table doesn’t reorder itself every time the query conditions alter in order to optimize the database performance but actually happens is that the index makes the database create a separate data structure which should be easily sortable.
It is important to note that when an index is created on a column in a database, it creates a data structure on that specific column and no other column is stored in this data structure. For instance, in the above example, our data structure will only contain the company_id
and no other columns such as unit
or unit_cost
.
But a legit question pops up here - how does the database know what other fields in the table are to be returned for a query. Let’s try to understand how.
Database indexes store pointers to simply reference information for the location of the additional information in the memory. In other words, the index holds the company_id
and that particular row’s address in the memory. In this example, the database index will look something like this:
COMPANY_ID |
POINTER |
---|---|
10 |
_123 |
10 |
_129 |
11 |
_127 |
11 |
_138 |
12 |
_124 |
12 |
_130 |
12 |
_135 |
14 |
_125 |
14 |
_131 |
14 |
_133 |
16 |
_128 |
18 |
_126 |
18 |
_131 |
18 |
_132 |
18 |
_137 |
20 |
_136 |
21 |
_134 |
With this index, the query can check for the rows in the company_id
column which have 18 as a value and then using the pointer, it can find the related information for that record.
Having understood what we expect from the index, let’s have a look at the common data structures that can be used for database indexing:
B-trees are the most often used index data structures because they are fast for lookups, deletions, and insertions. All of these operations are possible in logarithmic time and the data contained within a B-tree can be sorted easily.
Hash indexes are commonly used to describe indexes that utilize hash tables. Because hash tables are particularly efficient at looking up data, queries that look for an exact match may be processed rapidly. The key in a hash index is the column value, and the value in a hash table is a reference to the table's row data.
Hash tables, on the other hand, are not ordered data structures; therefore, they may be inefficient for other types of searches.
R-tree is frequently used in spatial databases, usually used to index multi-dimensional information such as geographical coordinates, rectangles, polygons, etc. It is useful for searches such as "find all the coffee shops within 2 miles of my location."
Bitmap indexes are useful for columns that have a high number of occurrences of such values, i.e. columns with low selectivity. For instance, consider a column having boolean values.
Indexes are designed to increase database performance; thus, indexing can be used whenever we need to significantly improve database performance. The greater your database expands, the more probable it is that indexing will benefit you.
However, the first and foremost thing to remember is that the index takes up extra space; therefore, the larger the table, the greater the index. Every time you perform an add, remove, or update operation, the same operation will need to be executed on the index as well.
When data is written to the database, the original table is updated first, followed by other indexes based on that table. When a write is done to the database, the indexes become inoperable until they are updated. The indexes will never be functional if the database is continually getting writes.
This is why indexes are often applied to databases in data warehouses that get new data on a planned basis (during off-peak hours) rather than production databases that may receive new writes all the time.
The following code snippet shows how to create an index on a single column in a SQL database:
CREATE INDEX name_index ON Employee (Employee_Name);
If you want to create an index on multiple columns, the SQL command will look something like this:
CREATE INDEX name_index ON Employee (Employee_Name, Employee_Age);
In general, an index should be constructed on a table only if the data in the indexed column will be frequently accessed.
So, we discussed database indexing in detail in this article and also learned about the data structures used to implement database indexing and also when it is advisable to use indexes and otherwise.
To sum everything up, here is a quick summary:
This is all for this article. Database Indexing is a vast and a bit complicated topic, I hope this article would be helpful in understanding the basics of the concept.
Keep reading!