Thinking about scaling beyond your Postgres cluster and adding another data store like Redis or Elasticsearch? Before adopting a more complex infrastructure, take a minute and think again. It’s quite possible to get more out of an existing Postgres database. It can scale for heavy loads and offers powerful features which are not obvious at first sight. For example, its possible to enable in-memory caching, text search, specialized indexing, and key-value storage.
After reading this article, you may want to list down the features you want from your data store and check if Postgres will be a good fit for them. It’s powerful enough for most applications.
As Fred Brooks put it in The Mythical Man-Month: "The programmer, like the poet, works only slightly removed from pure thought-stuff. [They] build castles in the air, from air, creating by exertion of the imagination."
Adding more pieces to those castles, and getting lost in the design, is endlessly fascinating; however, in the real world, building more castles in the air can get in your way. The same holds true for the latest hype in data stores. There are several advantages to choosing boring technology:
Although it can be managed by thoughtful design, adding multiple datastores does increase complexity. Before exploring adding additional datastores, it's worth investigating what additional features your existing datastores can offer you.
Many people are unaware that Postgres offers way more than just a SQL database. If you already have Postgres in your stack, why add more pieces when Postgres can do the job?
Postgres caches, too
There’s a misconception that Postgres reads and writes from disk on every query, especially when users compare it with purely in-memory data stores like Redis.
Actually, Postgres has a beautifully designed caching system with pages, usage counts, and transaction logs. Most of your queries will not need to access the disk, especially if they refer to the same data over and over again, as many queries tend to do.
The shared_buffer configuration parameter in the Postgres configuration file determines how much memory it will use for caching data. Typically it should be set to 25% to 40% of the total memory. That’s because Postgres also uses the operating system cache for its operation. With more memory, most recurring queries referring the same data set will not need to access the disk. Here is how you can set this parameter in the Postgres CLI:
ALTER SYSTEM SET shared_buffer TO = <value>
Managed database services like Heroku offer several plans where RAM (and hence cache) is a major differentiator. The free hobby version does not offer dedicated resources like RAM. Upgrade when you’re ready for production loads so you can make better use of caching.
You can also use some of the more advanced caching tools. For example, check the pg_buffercache view to see what’s occupying the shared buffer cache of your instance. Another tool to use is the pg_prewarm function which comes as part of the base installation. This function enables DBAs to load table data into either the operating system cache or the Postgres buffer cache. The process can be manual or automated. If you know the nature of your database queries, this can greatly improve application performance.
For the really brave at heart, refer to this article for an in-depth description of Postgres caching.
Text searching
Elasticsearch is excellent, but many use cases can get along just fine with Postgres for text searching. Postgres has a special data type, tsvector, and a set of functions, like to_tsvector and to_tsquery, to search quickly through text. tsvector represents a document optimized for text search by sorting terms and normalizing variants. Here is an example of the to_tsquery function:
SELECT to_tsquery('english', 'The & Boys & Girls');
to_tsquery
---------------
'boy' & 'girl'
You can sort your results by relevance depending on how often and which fields your query appeared in the results. For example, you can make the title more relevant than the body. Check the Postgres documentation for details.
Functions in Postgres
Postgres provides a powerful server-side function environment in multiple programming languages.
Try to pre-process as much data as you can on the Postgres server with server-side functions. That way, you can cut down on the latency that comes from passing too much data back and forth between your application servers and your database. This approach is particularly useful for large aggregations and joins.
What’s even better is your development team can use its existing skill set for writing Postgres code. Other than the default PL/pgSQL (Postgres’ native procedural language), Postgres functions and triggers can be written in PL/Python, PL/Perl, PL/V8 (JavaScript extension for Postgres) and PL/R.
Here is an example of creating a PL/Python function for checking string lengths:
CREATE FUNCTION longer_string_length (string1 string, string2 string)
RETURNS integer
AS $$
a=len(string1)
b-len(string2)
if a > b:
return a
return b
$$ LANGUAGE plpythonu;
Extensions are to Postgres are what plug-ins mean in many applications. Suitable use of Postgres extensions can also mean you don’t have to work with other data stores for extra functionality. There are many extensions available and listed on the main Postgres website.
Geospatial Data
PostGIS is a specialized extension for Postgres used for geospatial data manipulation and running location queries in SQL. It’s widely popular among GIS application developers who use Postgres. A great beginner’s guide to using PostGIS can be found here.
The code snippet below shows how we are adding the PostGIS extension to the current database. From the OS, we run these commands to install the package (assuming you are using Ubuntu):
$ sudo add-apt-repository ppa:ubuntugis/ppa
$ sudo apt-get update
$ sudo apt-get install postgis
After that, log in to your Postgres instance and install the extension:
CREATE EXTENSION postgis;
CREATE EXTENSION postgis_topology;
If you want to check what extensions you have in the current database, run this command:
SELECT * FROM pg_available_extensions;
Key-Value Data Type
The Postgres hstore extension allows storing and searching simple key-value pairs. This tutorial provides an excellent overview of how to work with hstore data type.
Semi-structured Data Types
There are two native data types for storing semi-structured data in Postgres: JSON and XML. The JSON data type can host both native JSON and its binary form (JSONB). The latter can significantly improve query performance when it is searched. As you can see below, it can convert JSON strings to native JSON objects:
SELECT '{"product1": ["blue", "green"], "tags": {"price": 10, "discounted": false}}'::json;
json
---------------------------------------------------------------------
{"product1": ["blue", "green"], "tags": {"price": 10, "discounted": false}}
If you’re considering switching off Postgres due to performance reasons, first see how far you can get with the optimizations it offers. Here we'll assume you've done the basics, like creating appropriate indexes. Postgres offers many advanced features, and while the changes are small they can make a big difference, especially if it keeps you from complicating your infrastructure.
Don’t over-index
Avoid unnecessary indexes. Use multi-column indexes sparingly. Too many indexes take up extra memory that crowd out better uses of the Postgres cache, which is crucial for performance.
Using a tool like EXPLAIN ANALYZE might surprise you by how often the query planer actually chooses sequential table scans. Since much of your table’s row data is already cached, oftentimes these elaborate indexes aren’t even used.
That said, if you do find slow queries, the first and most obvious solution is to see if the table is missing an index. Indexes are vital, but you have to use them correctly.
Partial indexes save space
A partial index can save space by specifying which values get indexed. For example, you want to order by a user’s signup date, but only care about the users who have signed up:
CREATE INDEX user_signup_date ON users(signup_date) WHERE is_signed_up;
Choosing the right index for your data can improve performance. Here are some common index types and when you should use each one.
B-tree indexes
B-tree indexes are binary trees that are used to sort data efficiently. They’re the default if you use the INDEX command. Most of the time, a B-tree index suffices. As you scale, inconsistencies can be a larger problem, so use the amcheck extension periodically.
BRIN indexes
A Block Range INdex (BRIN) can be used when your table is naturally already sorted by a column, and you need to sort by that column. For example, for a log table that was written sequentially, setting a BRIN index on the timestamp column lets the server know that the data is already sorted.
Bloom filter index
A bloom index is perfect for multi-column queries on big tables where you only need to test for equality. It uses a special mathematical structure called a bloom filter that’s based on probability and uses significantly less space.
CREATE INDEX i ON t USING bloom(col1, col2, col3);
SELECT * from t WHERE col1 = 5 AND col2 = 9 AND col3 = ‘x’;
GIN and GiST indexes
Use a GIN or GiST index for efficient indexes based on composite values like text, arrays, and JSON.
There are legitimate cases for adding another datastore beyond Postgres.
Special data types
Some data stores give you data types that you just can’t get on Postgres. For example, the linked list, bitmaps, and HyperLogLog functions in Redis are not available on Postgres.
At a previous startup, we had to implement a frequency cap, which is a counter for unique users on a website based on session data (like cookies). There might be millions or tens of millions of users visiting a website. Frequency capping means you only show each user your ad once per day.
Redis has a HyperLogLog data type that is perfect for a frequency cap. It approximates set membership with a very small error rate, in exchange for O(1) time and a very small memory footprint. PFADD adds an element to a HyperLogLog set. It returns 1 if your element is not in the set already, and 0 if it is in the set.
PFADD user_ids uid1
(integer) 1
PFADD user_ids uid2
(integer) 1
PFADD user_ids uid1
(integer)
Heavy real-time processing
If you’re in a situation with many pub-sub events, jobs, and dozens of workers to coordinate, you may need a more specialized solution like Apache Kafka. LinkedIn engineers originally developed Kafka to handle new user events like clicks, invitations, and messages, and allow different workers to handle message passing and jobs to process the data.
Instant full-text searching
If you have a real-time application under heavy load with more than ten searches going on at a time, and you need features like autocomplete, then you may benefit more from a specialized text solution like Elasticsearch.
Redis, Elasticsearch, and Kafka are powerful, but sometimes adding them does more harm than good. You may be able to get the capabilities you need with Postgres by taking advantage of the lesser-known features we’ve covered here. Ensuring that you are getting the most out of Postgres can save you time and help you avoid added complexity and risks.
To save even more time and headaches, consider using a managed service like Heroku Postgres. Scaling up is a simple matter of adding additional follower replicas, high availability can be turned on with a single click, and Heroku operates it for you. If you really need to expand beyond Postgres, the other data stores that we mentioned above, such as Redis, Apache Kafka and Elasticsearch, can all be easily provisioned on Heroku. Go ahead and build your castles in the air―but anchor them to a reliable foundation, so you can dream about a better product and customer experience.