Lately, one of my favorite database features is the JSONB data type in PostgreSQL. To recap the SQL vs NoSQL debate:
With JSONB data in PostgreSQL, you can get the best of both worlds. We know that a rigid schema is not a good fit for every type of data, because not all real-world data adheres to a strict format.
For example, log messages, errors, code traces, and JSON web services all return diverse families of data. And, there are plenty of good use cases for storing this kind of data in the database. But it's also true that relational technology is extremely powerful, and it it would be a shame to have to give it all up to just store a bit of unstructured data.
In this post, I’m not just recommending that you check out JSONB; you’ve probably already done that. I’m going to suggest two techniques that we use at AppLand to make JSONB data even more effective. Let's dig in.
You know that when you have SQL data stored in strongly typed columns, every row has a uniform structure, and data is both efficient to retrieve and easy to to index. What you may not know is that PostgreSQL can also create indexes on JSONB columns. So, if you have a common field such as name
within a JSONB document, you can add an index on that field. Then you can efficiently search by name
within the JSONB, and PostgreSQL won’t have to slog through all that JSON to find the matching names. Not only that, but you can also SELECT the name
from within the JSON and PostgreSQL can retrieve that data without parsing the JSON, using a technique called index-only scan.
Now, what if you want to be able to efficiently fetch a subset of JSONB data, without having to load and parse every object? At AppLand, we store AppMap data in the database. Each AppMap has a mandatory section called metadata
, and we want to be able to efficiently pull out the metadata of selected rows without having to parse the entire JSONB column (the metadata is usually less than 1% of the total JSON object). Awesomely, since PostgreSQL 12, PostgreSQL has a feature called generated columns.
If you’ve ever worked with triggers, you know that when a row is inserted, updated, or deleted, you can instruct the database to run an in-database function which can process the row in a pretty much unlimited way. A generated column is like a trigger; it runs automatically whenever a row is inserted or updated. The generated column specifically applies a function to the new row, and then stores the result of that function on the row.
So, you can use generated functions to pull frequently used data out of a JSONB column, and store it on the row. Then you can write a query to access the copied/cached (“de-normalized”) data. It’s truly the best of both relational and non-relational worlds.
Here's a concrete example of a JSONB index and generated column. For this example, I will use AppMap JSON, which represents a class dependency diagram and execution trace of a program. At the top level, the AppMap format includes metadata
, classMap
, and events
.
Let's store the AppMap data in a simple table called appmaps
, and add a minimal record:
CREATE TABLE appmaps (
id uuid DEFAULT gen_random_uuid() PRIMARY KEY,
data jsonb
);
-- data is schema-less so we can omit all fields except metadata
INSERT INTO appmaps ( data )
VALUES ( '{"metadata":{"name":"login"}}'::jsonb );
Optimizing with an index
When a user wants to retrieve an AppMap by name, the server needs to search data->'metadata'->>'name'
. The double-arrow ->>
instructs PostgreSQL to convert the result to text
(from JSON), so that the query condition can be ordinary text.
CREATE INDEX idx_appmaps_name
ON appmaps USING BTREE ((data->'metadata'->>'name'));
Note: There are different types of indexes you can use, with advantages and disadvantages. BTREE
is good for text.
This entire query can now be executed using just the index, using index-only scan:
SELECT data->'metadata'->>'name' AS name
FROM appmaps WHERE data->'metadata'->>'name' = 'login';
name
-------
login
(1 row)
Optimizing with a generated column
Let's say we want to be able to efficiently pull out the metadata
from the data column. Add the GENERATED
column:
ALTER TABLE appmaps ADD COLUMN metadata JSONB
GENERATED ALWAYS AS (data->'metadata') STORED;
All rows now contain the metadata
column which holds stored, denormalized data:
SELECT id, metadata from appmaps;
id | metadata
--------------------------------------+-------
c5e4cfa5-1945-4685-a50f-7e5b0edd62e5 | {"name": "login"}
(1 row)
These are a couple of techniques that we are using at AppLand to make JSON data act "relational" when we want it to. Not only does PostgreSQL provide the powerful JSONB data type, but it also provides the shortcuts and usability tricks that make JSONB truly useful.
For those who are ramping up on some new-to-you backend code, and finding it hard to figure out how all the pieces fit together - see what the code really does with AppMap - Google Maps for code.
Also published on https://dev.to/kgilpin/effective-sql-indexing-and-denormalizing-jsonb-1bhm.