paint-brush
Postgres TOAST: Understanding the Data Compression Mechanism and Its Limitations by@timescale
7,544 reads
7,544 reads

Postgres TOAST: Understanding the Data Compression Mechanism and Its Limitations

by TimescaleNovember 3rd, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

This article discusses the limitations of PostgreSQL's TOAST mechanism for data compression, which was originally designed to manage large values within a PostgreSQL page. While TOAST has its merits, it falls short in terms of optimizing storage efficiency for modern applications with large datasets. The article introduces TimescaleDB's columnar compression functionality as a solution to significantly reduce database size while improving query performance. This approach allows for more efficient storage, making PostgreSQL a more formidable choice for data management in contemporary applications.
featured image - Postgres TOAST: Understanding the Data Compression Mechanism and Its Limitations
Timescale HackerNoon profile picture


If you’re working with large databases in Postgres, this story will sound familiar. As your Postgres database keeps growing, your performance starts to decline, and you begin to worry about storage space—or, to be precise, how much you’ll pay for it. You love PostgreSQL, but there’s something you wish you had: a highly effective data compression mechanism.


PostgreSQL does have somewhat of a compression mechanism: TOAST 🍞. In this post, we’ll walk you through how Postgres TOAST works and the different TOASTing strategies. As much as we enjoy a good TOAST, we’ll discuss why this is not the kind of compression feature you need for reducing the storage footprint of modern large databases—and how, as the PostgreSQL enthusiasts that we are here at Timescale, we decided to build a more suitable compression mechanism for PostgreSQL, inspired by the columnar design of NoSQL databases.


What Is Postgres TOAST?

Even if it might reduce the size of datasets, TOAST (The Oversized Attribute Storage Technique) is not your traditional data compression mechanism. To understand what TOAST is, we have to start by talking about how data is stored in PostgreSQL.


Postgres’ units of storage are called pages, and these have a fixed size (8 kB by default). Having a fixed page size gives Postgres many advantages, namely its data management simplicity, efficiency, and consistency, but it comes with a downside: some data values might not fit within that page.


This is where TOAST comes in. TOAST refers to the automatic mechanism that PostgreSQL uses to efficiently store and manage values in Postgres that do not fit within a page. To handle such values, Postgres TOAST will, by default, compress them using an internal algorithm. If, after compression, the values are still too large, Postgres will move them to a separate table (called the TOAST table), leaving pointers in the original table.


As we’ll see later in this article, you can actually modify this strategy as a user, for example, by telling Postgres to avoid compressing data in a particular column.


TOAST-able Data Types

The data types that might be subjected to TOAST are primarily variable-length ones that have the potential to exceed the size limits of a standard PostgreSQL page. On the other hand, fixed-length data types, like integerfloat, or  timestamp, are not subjected to TOAST since they fit comfortably within a page.


Some examples of data types that might be subjected to TOAST are:


  • json and jsonb

  • Large text strings

  • varchar and varchar(n) (If the length specified in varchar(n) is small enough, then values of that column might always stay below the TOAST threshold.)

  • bytea storing binary data

  • Geometric data like path and polygon and PostGIS types like  geometry or geography


How Does Postgres TOAST Work?

Understanding TOAST not only relates to the concept of page size but also to another Postgres storage concept: tuples. Tuples are rows in a PostgreSQL table. Typically, the TOAST mechanism kicks in if all fields within a tuple have a total size of over 2 kB approx.


If you’ve been paying attention, you might be wondering, “Wait, but the page size is around 8 kB—why this overhead?” That’s because PostgreSQL likes to ensure it can store multiple tuples on a single page: if tuples are too large, fewer tuples fit on each page, leading to increased I/O operations and reduced performance.


Postgres also needs to keep free space to fit additional operational data: each page stores not just the tuple data but also additional information for managing the data, such as item identifiers, headers, and transaction information.


So, when the combined size of all fields in a tuple exceeds approximately 2 kB (or the TOAST threshold parameter, as we’ll see later), PostgreSQL takes action to ensure that the data is stored efficiently. TOAST handles this in two primary ways:


  1. Compression. PostgreSQL can compress the large field values within the tuple to reduce their size using a compression algorithm that we’ll cover later in this article. By default, if compression is sufficient to bring the tuple's total size below the threshold, the data will remain in the main table, albeit in a compressed format.


  2. Out-of-line storage. If compression alone isn't effective enough to reduce the size of the large field values, Postgres moves them to a separate TOAST table. This process is known as "out-of-line" storage because the original tuple in the main table doesn’t hold the large field values anymore. Instead, it contains a "pointer" or reference to the location of the large data in the TOAST table.


We’re simplifying things slightly for this article—read the PostgreSQL documentation for a full detailed view.


The Postgres Compression Algorithm: pglz

We’ve mentioned that TOAST can compress large values in PostgreSQL. But which compression algorithm is PostgreSQL using, and how effective is it?


The pglz (PostgreSQL Lempel-Ziv) is the default internal compression algorithm used by PostgreSQL specifically tailored for TOAST.


Here’s how it works in very simple terms:


  • pglz tries to avoid repeated data. When it sees repeated data, instead of writing the same thing again, it just points back to where it wrote it before. This "avoiding repetition" helps in saving space.


  • As pglz reads through data, it remembers a bit of the recent data it has seen. This recent memory is referred to as the "sliding window."


  • As new data comes in, pglz checks if it has seen this data recently (within its sliding window). If yes, it writes a short reference instead of repeating the data.


  • If the data is new or not repeated enough times to make a reference shorter than the actual data, pglz just writes it down as it is.


  • When it's time to read the compressed data, pglz uses its references to fetch the original data. This process is quite direct, as it looks up the referred data and places it where it belongs.


  • pglz doesn't need separate storage for its memory (the sliding window); it builds it on the go while compressing and does the same when decompressing.


This implementation is designed to offer a balance between compression efficiency and speed within the TOAST mechanism. In terms of compression rate, the effectiveness of pglz will largely depend on the nature of the data.


For example, highly repetitive data will compress much better than high entropy data (like random data). You might see compression ratios in the range of 25 to 50 percent, but this is a very general estimate—results will vary widely based on the exact nature of the data.


Configuring TOAST

TOAST strategies

By default, PostgreSQL will go through the TOAST mechanism according to the procedure explained earlier (compression first and out-of-line storage next, if compression is not enough). Still, there might be scenarios where you might want to fine-tune this behavior on a per-column basis. PostgreSQL allows you to do this by using the TOAST strategies PLAINEXTERNALEXTENDED, and MAIN.


  • EXTENDED: This is the default strategy. It implies the data will be stored out of line in a separate TOAST table if it’s too large for a regular table page. Before moving the data to the TOAST table, it will be compressed to save space.


  • EXTERNAL: This strategy tells PostgreSQL to store the data for this column out of line if the data is too large to fit in a regular table page, and we’re asking PostgreSQL not to compress the data—the value will just be moved to the TOAST table as-is.


  • MAIN: This strategy is a middle ground. It tries to keep data in line in the main table through compression; if the data is definitely too large, it'll move the data to the TOAST table to avoid an error, but PostgreSQL won't move the compressed data. Instead, it’ll store the value in the TOAST table in its original form.


  • PLAIN: Using PLAIN in a column tells PostgreSQL to always store the column's data in line in the main table, ensuring it isn't moved to an out-of-line TOAST table. Take into account that if the data grows beyond the page size, the INSERT will fail because the data won’t fit.


If you want to inspect the current strategies of a particular table, you can run the following:


\d+ your_table_name 


You'll get an output like this:

=> \d+ example_table
                     Table "public.example_table"
 Column  |       Data Type   | Modifiers | Storage  | Stats target | Description 
---------+------------------+-----------+----------+--------------+-------------
  bar    | varchar(100000)  |           | extended |              | 


If you wish to modify the storage setting, you can do so using the following command:

-- Sets EXTENDED as the TOAST strategy for bar_column 
ALTER TABLE example_blob ALTER COLUMN bar_column SET STORAGE EXTENDED;

Key parameters

Apart from the strategies above, these two parameters are also important to control TOAST behavior:


TOAST_TUPLE_THRESHOLD


This is the parameter that sets the size threshold for when TOASTing operations (compression and out-of-line storage) are considered for oversized tuples.


As we’ve mentioned previously, by default, TOAST_TUPLE_THRESHOLD is set to approximately 2 kB.


TOAST_COMPRESSION_THRESHOLD


This is the parameter that specifies the minimum size of a value before Postgres considers compressing it during the TOASTing process.


If a value surpasses this threshold, PostgreSQL will attempt to compress it. However, just because a value is above the compression threshold, it doesn't automatically mean it will be compressed: the TOAST strategies will guide PostgreSQL on how to handle the data based on whether it was compressed and its resultant size relative to the tuple and page limits, as we’ll see in the next section.


Bringing it all together

TOAST_TUPLE_THRESHOLD is the trigger point. When the size of a tuple's data fields combined exceeds this threshold, PostgreSQL will evaluate how to manage it based on the set TOAST strategy for its columns, considering compression and out-of-line storage. The exact actions taken will also depend on whether column data surpasses the TOAST_COMPRESSION_THRESHOLD:


  • EXTENDED (default strategy): If a tuple's size exceeds TOAST_TUPLE_THRESHOLD, PostgreSQL will first attempt to compress the oversized columns if they also exceed TOAST_COMPRESSION_THRESHOLD. If compression brings the tuple size under the threshold, it will remain in the main table. If it doesn't, the data will be moved to an out-of-line TOAST table, and the main table will contain pointers to this external data.


  • MAIN: If the tuple size surpasses the TOAST_TUPLE_THRESHOLD, PostgreSQL will try to compress the oversized columns (provided they're over the TOAST_COMPRESSION_THRESHOLD). If compression allows the tuple to fit within the main table's tuple, it remains there. If not, the data is moved to the TOAST table in its uncompressed form.


  • EXTERNAL: PostgreSQL skips compression, regardless of the TOAST_COMPRESSION_THRESHOLD. If the tuple's size is beyond the TOAST_TUPLE_THRESHOLD, the oversized columns will be stored out-of-line in the TOAST table.


  • PLAIN: Data is always stored in the main table. If a tuple's size exceeds the page size (due to having very large columns), an error is raised.


Strategy

Compress if tuple > TOAST_COMPRESSION_THRESHOLD

Store out-of-line if tuple > TOAST_TUPLE_THRESHOLD

Description

EXTENDED

Yes

Yes

Default strategy. Compresses first, then checks if out-of-line storage is needed.

MAIN

Yes

Only in uncompressed form

Compresses first, and if still oversized, moves to TOAST table without compression.

EXTERNAL

No

Yes

Always moves to TOAST if oversized, without compression.

PLAIN

No

No

Data always stays in the main table. If a tuple exceeds the page size, an error occurs.


Why TOAST Isn't Enough as a Data Compression Mechanism in PostgreSQL

By now, you’ll probably understand why TOAST is not the data compression mechanism you wish you had in PostgreSQL. Modern applications imply large volumes of data ingested daily, meaning databases (over)grow quickly.


Such a problem was not as prominent when our beloved Postgres was built decades ago, but today’s developers need compression solutions for reducing the storage footprint of their datasets.


While TOAST incorporates compression as one of its techniques, it's crucial to understand that its primary role isn't to serve as a database compression mechanism in the traditional sense. TOAST is mainly a solution to one problem: managing large values within the structural confines of a Postgres page.


While this approach can lead to some storage space savings due to the compression of specific large values, its primary purpose is not to optimize storage space across the board.


For example, if you have a 5 TB database made up of small tuples, TOAST won’t help you turn those 5 TB into 1 TB. While there are parameters within TOAST that can be adjusted, this won't transform TOAST into a generalized storage-saving solution.


And there are other inherent problems with using TOAST as a traditional compression mechanism in PostgreSQL, for example:


  • Accessing TOASTed data can add overhead, especially when the data is stored out of line. This becomes more evident when many large text or other TOAST-able data types are frequently accessed.


  • TOAST lacks a high-level, user-friendly mechanism for dictating compression policies. It’s not built to optimize storage costs or facilitate storage management.


  • TOAST's compression is not designed to provide especially high compression ratios. It only uses one algorithm (pglz) with compression rates varying typically from 25-50 percent.

Adding Columnar Compression to PostgreSQL With Timescale

Via the TimescaleDB extension, PostgreSQL users have a better alternative. Inspired by the compression design of NoSQL databases, we added columnar compression functionality to PostgreSQL. This transformative approach transcends PostgreSQL’s conventional row-based storage paradigm, introducing the efficiency and performance of columnar storage.


By adding a compression policy to your large tables, you can reduce your PostgreSQL database size by up to 10x (achieving +90 percent compression rates).


By defining a time-based compression policy, you indicate when data should be compressed. For instance, you might choose to compress data older than seven (7) days automatically:


-- Compress data older than 7 days
SELECT add_compression_policy('my_hypertable', INTERVAL '7 days');


Via this compression policy, Timescale will transform the table partitions (which in Timescale are also created automatically) into a columnar format behind the scenes, combining many rows (1,000) into an array. To boost compressibility, Timescale will apply different compression algorithms depending on the data type:


  • Gorilla compression for floats

  • Delta-of-delta + Simple-8b with run-length encoding compression for timestamps and other integer-like types

  • Whole-row dictionary compression for columns with a few repeating values (+ LZ compression on top)

  • LZ-based array compression for all other types


This columnar compression design offers an efficient and scalable solution to the problem of large datasets in PostgreSQL. It allows you to use less storage to store more data without hurting your query performance (it improves it). And in the latest versions of TimescaleDB, you can also INSERTDELETE, and UPDATE directly over compressed data.

Wrap-Up

We hope this article helped you understand that while TOAST is a well-thought-out mechanism to manage large values within a PostgreSQL page, it’s not effective for optimizing database storage use within the realm of modern applications.


If you’re looking for effective data compression that can move the needle on your storage savings, give Timescale a go. You can try our cloud platform that propels PostgreSQL to new performance heights, making it faster and fiercer—it’s free, and no credit card is required—or you can add the TimescaleDB extension to your self-hosted PostgreSQL database.


Written by Carlota Soto.