Database performance depends heavily on indexing, and indeed it is no exaggeration to say that bad indexing is the main source of bad performance.
Yes, it’s important to adjust memory parameters and all that, but it is all in vain if indexes are not used properly.
There is simply no replacement for a missing index, and there is no way to achieve good performance without proper indexing.
For data that can be sorted (using the <, <=, =, >=, and > operators), B-tree indexes generally work well. Sometimes, however, B-trees are just not enough.
This is because not every data type can be sorted in a useful way. Just imagine a polygon. How would you sort these objects in a useful way? Sure, you can sort by the area covered, its length, and so on, but doing this won't allow you to actually find them using a geometric search.
The solution to this problem is to provide more than just one index type. Each index will serve a special purpose and do exactly what is needed. The following six index types are available (as of PostgreSQL 10.0):
test=# SELECT * FROM pg_am;
oid | amname | amhandler | amtype
------+----------------+-----------------------------------+-------------
2 | heap | heap_tableam_handler | t
403 | btree | bthandler | i
405 | hash | hashhandler | i
783 | gist | gisthandler | i
2742 | gin | ginhandler | i
4000 | spgist | spghandler | i
3580 | brin | brinhandler | i
(7 rows)
I’ll assume you already know about B-trees. (If not, you could check out the resources mentioned in Further reading, below.) But when should one contemplate using these other index types? The following guide outlines the purpose of each index type available in PostgreSQL.
Note that there are some extensions out there that can be used on top of what you can see here. Additional index types that are available on the web include rum, vodka, and, in the future, cognac.
Getting started
Before going further, let’s create some test data to refer to later. Here’s the code snippet to do that:
test=# DROP TABLE IF EXISTS t_test;
DROP TABLE
test=# CREATE TABLE t_test (id serial, name text);
CREATE TABLE
test=# INSERT INTO t_test (name) SELECT 'hans'
FROM generate_series(1, 2000000);
INSERT 0 2000000
test=# INSERT INTO t_test (name) SELECT 'paul'
FROM generate_series(1, 2000000);
INSERT 0 2000000
In the first line, a simple table is created. Two columns are used; the first is an auto-increment column that just keeps creating numbers, and the second is a column that will be filled with static values. In all, 4 million rows have been added:
test=# SELECT name, count(*) FROM t_test GROUP BY 1;
name | count
---------+---------
hans | 2000000
paul | 2000000
(2 rows)
Hash indexes have been around for many years. The idea is to hash the input value and store it for later lookups. Having hash indexes makes sense, but before PostgreSQL 10.0 it was not advisable to use hash indexes because PostgreSQL had no WAL (write-ahead logging) support for them.
In PostgreSQL 10.0, this has changed. Hash indexes are now fully logged and are therefore ready for replication and considered to be 100% crash-safe.
Hash indexes are generally a bit larger than B-tree indexes. Suppose you want to index 4 million integer values. A B-tree will need around 90 MB of storage to do this.
A hash index will need around 125 MB on disk. The assumption that's made by many people is that a hash is super-small on disk, but in many cases that assumption is wrong.
Generalized Search Tree (GiST) indexes are an important index type because they can be applied to a variety of different purposes. GiST indexes can be used to implement R-tree behavior (see https://en.wikipedia.org/wiki/R-tree), and it is even possible for them to act as B-trees. However, abusing GiST for B-tree indexes is not recommended.
Typical use cases for GiSTs are as follows:
To many people, GiST is still a black box, so let’s try to outline how GiST works internally. Consider the following diagram:
Take a look at the tree. You can see that R1 and R2 are on top. R1 and R2 are the bounding boxes that contain everything else. R3, R4, and R5 are contained by R1. R8, R9, and R10 are contained by R3, and so on.
A GiST index is therefore hierarchically organized. What you can see in the above diagram is that some operations that aren't available in B-trees are supported, for example overlaps, left of, right of, and so on. The layout of a GiST tree is ideal for geometric indexing.
Of course, it is also possible to come up with your own operator classes. The following strategies are supported:
Operation | Strategy number
Strictly left of | 1
Does not extend to right of | 2
Overlaps | 3
Does not extend to left of | 4
Strictly right of | 5
Same | 6
Contains | 7
Contained by | 8
Does not extend above | 9
Strictly below | 10
Strictly above | 11
Does not extend below | 12
If you want to write operator classes for GiST, a couple of support functions have to be provided. In the case of a B-tree, there is only one function. GiST indexes provide a lot more:
Generalized Inverted (GIN) indexes are a good way to index text. Suppose you want to index one million text documents. A certain word may occur millions of times. In a normal B-tree, this would mean that the key is stored millions of times.
This is not the case in GIN. Each key (or word) is stored once and assigned to a document list. Keys are organized in a standard B-tree.
Each entry will have a document list pointing to all the entries in the table that have the same key. A GIN index is very compact. However, it lacks an important feature found in B-trees — sorted data. In GIN, the list of item pointers associated with a certain key is sorted by the position of the row in the table, and not by arbitrary criteria.
Just like any other index, GIN can be extended. The following strategies are available:
Operation | Strategy number
Overlap | 1
Contains | 2
Is contained by | 3
Equal | 4
On top of this, the following support functions are available:
If you are looking for a good example of how to extend GIN, consider looking at the
btree_gin
module in the PostgreSQL contrib
directory. It is a valuable source of information and a good way for you to start your own implementation.
A Space-Partitioned GiST (SP-GiST) is mainly designed for in-memory use. The reason for this is that an SP-GiST stored on disk needs a fairly high number of disk hits to function. Disk hits are way more expensive than just following a couple of pointers in the RAM.
The beauty is that SP-GiST can be used to implement various types of trees, such as quadtrees, k-d trees, and radix trees (tries).
The following strategies are provided:
Operation | Strategy number
Strictly left of | 1
Strictly right of | 5
Same | 6
Contained by | 8
Strictly below | 10
Strictly above | 11
To write your own operator classes for SP-GiST, a couple of functions have to be provided:
Block Range Indexes (BRINs) are of great practical use. All the indexes we've discussed up until now need quite a lot of disk space. Although a lot of work has gone into shrinking GIN indexes and the like, they still need quite a lot of disk space because an index pointer is needed for each entry.
So if there are 10 million entries, there will be 10 million index pointers. Space is the main concern addressed by BRINs.
A BRIN does not keep an index entry for each tuple but will store the minimum and maximum values of 128 (default) blocks of data (1 MB). The index is therefore very small but lossy.
Scanning the index will return more data than we asked for. PostgreSQL has to filter out these additional rows in a later step.
The following example demonstrates how small a BRIN really is:
In my example, the BRIN index is 2,000 times smaller than a standard B-tree. The question naturally arising now is, why don't we always use BRIN indexes?
To answer this question, it is important to reflect on the layout of BRIN; the minimum and maximum values for 1 MB are stored. If the data is sorted (high correlation), BRIN is pretty efficient because we can fetch 1 MB of data and scan it, and we are done.
However, what if the data is shuffled? In this case, BRIN won't be able to exclude chunks of data anymore because it is very likely that something close to the overall high and the overall low is within that 1 MB of data.
Therefore, BRIN is mostly made for highly correlated data. In reality, correlated data is quite likely in data warehousing applications. Often, data is loaded every day, and therefore dates can be highly correlated.
BRIN supports the same strategies as a B-tree and therefore needs the same set of operators. The code can be reused nicely:
Operation | Strategy number
Less than | 1
Less than or equal | 2
Equal | 3
Greater than or equal | 4
Greater than | 5
The support functions required by BRIN are as follows:
Since PostgreSQL 9.6, there has been an easy way to deploy entirely new index types as extensions. This is pretty cool because if those index types provided by PostgreSQL are not enough, it is possible to add additional ones that serve your purpose. The instruction to do this is CREATE ACCESS METHOD:
Don't worry too much about this command — if you ever deploy your own index type, it will come as a ready-to-use extension.
One of these extensions implements bloom filters, which are probabilistic data structures. They sometimes return too many rows, but never too few. Therefore, a bloom filter is a good way to pre-filter data.
How does it work? A bloom filter is defined on a couple of columns. A bitmask is calculated based on the input values, which is then compared to your query. The upside of a bloom filter is that you can index as many columns as you want. The downside is that the entire bloom filter has to be read. Of course, the bloom filter is smaller than the underlying data and so it is, in many cases, very beneficial.
To use the bloom filters, just activate the extension, which is a part of the PostgreSQL contrib package:
test=# CREATE EXTENSION bloom;
CREATE EXTENSION
As we stated previously, the idea behind a bloom filter is that it allows you to index as many columns as you want. In many real-world applications, the challenge is to index many columns without knowing which combinations the user will actually need at runtime. In the case of a large table, it is totally impossible to create standard B-tree indexes on, say, 80 fields or more. A bloom filter might be an alternative in this case:
Note that I have queried a combination of random columns; they are not related to the actual order in the index. The bloom filter will still be beneficial.
In this brief guide we’ve seen that PostgreSQL supports a variety of index types beyond the well-known B-tree, each of which lends itself to a particular set of circumstances or use case. Knowing which sort of index to use will help you create PostgreSQL-based solutions that are optimally performant with your data and for the problems you want to solve.
Further reading
Indexing and Hashing - here
PostgreSQL documentation: 11.2. Index types - here
Climbing B-tree Indexes in Postgres - here