Strategies for Database Size Reduction in PostgreSQL: A Usage-Based Approachby@timescale
9,085 reads
9,085 reads

Strategies for Database Size Reduction in PostgreSQL: A Usage-Based Approach

by TimescaleNovember 10th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Explore essential strategies for PostgreSQL database optimization in a usage-based pricing model. Learn how to reduce storage costs, improve performance, and embrace a culture of continuous improvement.

People Mentioned

Mention Thumbnail
featured image - Strategies for Database Size Reduction in PostgreSQL: A Usage-Based Approach
Timescale HackerNoon profile picture

Database pricing models are hard. As a developer looking for a managed database, one of the most annoying (and yet crucial) aspects of the search process involves evaluating not only the upfront price of the solution for your database size but also how the pricing works and how well it will scale.

On top of the nuances when evaluating database pricing (e.g., “How much does the bill increase as my data grows?”, “Am I being charged per writes or reads?”, and “Do I have to pay more per backups?”), developers tend to overlook one aspect: the way a database's pricing model is structured will influence how you manage your data and assess your Postgres database size.

Different pricing models require different approaches to running PostgreSQL. For example, if you’re locked into a large disk, you might not prioritize reducing your database size. If you’re charged per data read, you might think twice about running certain queries, and if you’re charged per data ingress, you may be cautious about the frequency and volume of newly ingested data.

Each pricing element subtly influences the strategies and behaviors you’ll end up adopting, pushing you toward a particular way of managing and interacting with your database to ensure both cost-efficiency and optimal performance.

At Timescale, we transitioned to a usage-based storage model a few months ago, with great customer feedback. In this blog post, we’ll explore the operational advantages we’ve observed from our customers since transitioning to this model, together with tactics on how to optimally manage your PostgreSQL storage in a usage-based model.

Usage-based storage models are becoming increasingly popular in the database world—who wants to pay for disk space they’re not using? Still, a usage-based model doesn’t come without consequences, and you need to consider some best practices to navigate it effectively.

Quick Recap on PostgreSQL Storage Models

To lay out a common ground for our discussion on managing your database size in a usage-based model, let’s start by quickly covering how pricing works in our PostgreSQL platform (Timescale) and how it compares to other managed PostgreSQL products, like Amazon RDS for PostgreSQL and Amazon Aurora.

Starting yesterday (💥), we offer two types of database services in the Timescale platform:

  • Dynamic PostgreSQL services, where developers can create PostgreSQL databases with dynamic compute to save money without sacrificing performance.

  • Time Series services, where developers can create PostgreSQL databases boosted with extra performance and scalability via hypertables, columnar compression, continuous aggregates, and tiered storage.

Let’s focus on how we price storage on our platform. Both these services come with a usage-based model for storage, which means the following:

  • Developers are charged for the volume they use, without small print—no minimum database size, no minimum scaling steps. If, by the end of the month, you’ve used 128 GB, that’s what you’ll be billed for. You can start at 1 GB and grow to TBs.

  • There are no charges based on data written, data transfers, or queries run.

  • No need to allocate storage when creating a database or scaling.

  • No need to manually downsize disks.

To bring this home, let’s lay out the differences between this model, Amazon RDS PostgreSQL, and Amazon Aurora:

In summary, here are the three main takeaways from our comparison:

  • Both Timescale and Aurora charge for your actual database size. In contrast, RDS PostgreSQL charges for provisioned storage. You can’t downsize volumes in RDS.

  • Timescale does not charge for data written, data transfers, or query operations. Both RDS and Aurora charge per data transfer, extra backup storage, and you might incur some extra I/O charges, depending on which type of storage you’re picking.

  • Timescale has no minimum scaling steps, whereas Aurora scales in 10 GB increments after starting with 10 GB.

As you can see, Timescale’s model is closer to Aurora I/O-Optimized, with the difference that Timescale lacks scaling steps and extra charges for things like data transfer. In contrast, RDS is a good representation of the allocation-based model, even if RDS lacks the “storage tiers” traditionally found in database vendors operating on this model.

How Usage-Based Pricing Improves PostgreSQL Storage Management

As we previously introduced, different pricing models imply different database experiences. When we transitioned from an allocation-based to a usage-based model, our customers told us they saw immediate improvements in three operational areas:

  • They didn’t need to provision storage upfront when starting a database, which led to fewer overprovisioning mistakes and, therefore, lower storage bills.
  • They didn’t have to think about storage while scaling up.
  • They could downscale—e.g., if they deleted data, they stopped paying for it.

Usage-based models eliminate the problem of storage overprovisioning

In traditional allocation-based models, developers often find themselves predicting their storage needs, which, very often, leads to storage overprovisioning. We observed this across our fleet when Timescale operated on a usage-based model: the majority of our customers were not using their full disk capacity, which means they were essentially paying for storage space they weren't yet using. Usage-based models eliminate this guessing game (and the consequences of wrong guesswork).

You don’t need to think about storage while scaling up

“Timescale's usage-based storage means I don't have to think about disk size anymore. Our storage cost is down 30 %, and I didn't have to do anything!"

Adam McCrea, Judoscale (Timescale customer)

In usage-based models, the storage scales seamlessly as your database grows. A main source of stress in traditional allocation-based models is the danger of running out of disk space, which can lead to numerous problems ranging from application downtime and lost transactions to data corruption in the worst-case scenarios.

With usage-based models, developers no longer have to vigilantly monitor their storage to avoid hitting a storage wall. At the same time, they don’t have to worry about heavy autoscaling steps or tier jumps.

You can downscale (stop paying for data you deleted)

Last but not least, usage-based models allow you to “downscale.” If you delete data (or manage to reduce your database size considerably), you start paying less per storage, which sounds only fair. As we’ll see in the next section, Timescale has a few features to help customers reduce their PostgreSQL database size. Adopting a usage-based model allowed our customers to immediately realize savings when reducing disk usage, which incentivized them to keep their database lean.

How to Effectively Navigate a Usage-Based Model: Tips to Reduce Your Postgres Database Size

The benefits we just mentioned improve the developer experience of working with storage significantly, which is why usage-based models are becoming very popular. But usage-based pricing comes with an obvious (yet profound) consequence: it forces developers to adopt good data practices to reduce their PostgreSQL database size as much as possible.

When you know that your storage costs are directly linked to the disk size you’re actually using, there’s a newfound incentive to be more judicious with data storage. If you’re operating in a usage-based model for storage, it becomes your responsibility to ensure you’re storing data efficiently.

In a way, this can be considered a “drawback” of usage-based models, and it requires some work—but this is actually a blessing in disguise. In traditional allocation-based models, data hygiene can be somewhat overlooked because the costs are fixed: if you’re locked into a 500 GB disk in RDS, and your database is 200 GB, you don’t seem to have a big incentive to make storage usage efficient.

But here’s the thing: good data practices are not just about saving money. To scale a PostgreSQL database, it is essential to keep it optimized. By adopting good PostgreSQL data management practices, you will not only see better performance but your life as a database administrator will get much easier.

With this in mind, let’s run through some practices that will help you navigate a usage-based model for storage as effectively as possible, reducing your PostgreSQL database size in a systematic way. In the particular case of Timescale, we have some particularly helpful features, which we’ll cover as well.

Reduce bloat in your PostgreSQL database

A first must-do in a usage-based model is to pay attention to the specifics of how PostgreSQL storage works, i.e., you must reduce bloat on a regular basis. This will help you not only with your database size but also with your I/O requirements. We’ll point you toward some basics in this section, but this HN thread has some great advice, and we’ve also written a blog post on table bloat in PostgreSQL.

  • Monitor dead tuples. A proactive approach to optimizing PostgreSQL storage begins with keeping a close eye on dead tuples. Dead tuples, if left unchecked, can accumulate and lead to unnecessary storage overheads. The pgstattuple() extension can be a great ally to understanding how tables manage these tuples.

  • Vacuum frequently. To effectively combat table bloat, you might want to configure autovacuum to run more frequently. Rather than globally adjusting autovacuum settings in postgresql.conf, it's more precise to fine-tune these settings on a per-table basis. This caters to the fact that different tables can have varying tendencies for accumulating dead tuples. It's also crucial to monitor long-running transactions that might hinder autovacuum operations.

  • Reclaim unused pages. While autovacuum and vacuum address dead tuples, reclaiming unused pages demands a different approach. Although VACUUM FULL can be utilized for this purpose, it has the inherent limitation of locking the entire table. Pg_repack can help you with this.

Do some PostgreSQL fine-tuning

Systematically reducing your PostgreSQL database size is closely related to being able to scale your PostgreSQL database effectively, i.e., keeping things fast and agile while you’re adding more and more data. Keeping an eye on (and perhaps adjusting) some key PostgreSQL parameters can help. This article reviews the most important performance parameters. Here are some aspects you need to consider:

  • shared_buffers: controls the memory used for PostgreSQL’s page cache, and it’s typically set to 25 % of the system’s total RAM.

  • work_mem: it sets the memory allocated per operation within a query. In Timescale, the recommended setting is (25 % of RAM) / max_connections.

  • max_connections: it sets the max number of concurrent connections allowed to the database. Connection poolers can help manage many short-lived connections.

  • max_worker_processes: it determines the max number of worker processes PostgreSQL can initiate. If using Timescale, the formula to set this parameter is: max_worker_processes = 3 + timescaledb.max_background_workers + max_parallel_workers.

  • max_parallel_workers: it specifies the max number of workers supporting parallel queries. The default is to set this equal to the number of available CPUs.

  • autovacuum_max_workers: it determines the max number of concurrent autovacuum processes. In Timescale, it is set to 10 by default.

Practice indexing hygiene

Regularly optimizing indexes will help keep your PostgreSQL efficient, especially as the database scales. While indexes can help you improve query performance when used wisely, excessive indexes can create issues in large PostgreSQL deployments.

The most obvious consequence of excessive indexing is increased storage consumption, as every index necessitates separate storage, which grows proportionally with the size of the tables. This overhead can become more significant when tables are partitioned, like in Timescale’s hypertables.

Unnecessary indexes can also be counterproductive in improving your write operations, as each new data entry or modification implies concurrent index updates, leading to more I/O operations and potential table bloat. Monitoring your indexes will help you identify which ones are no longer being used, keeping things lean.

One way to do this is by using pg_stat_user_indexes:

-- Retrieve indexes that might not be used or are infrequently used
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan AS times_used
    idx_scan = 0
    OR idx_scan < some_low_threshold -- replace 'some_low_threshold' with a value that makes sense for your context
    idx_scan ASC, schemaname, relname;

If an index has a value of 0 in the idx_scan column, it means it hasn't been used since the last time the statistics were reset, meaning that it could be a candidate for optimization. By setting a low threshold for idx_scan, you can also identify infrequently used indexes.

Shrink your large tables via Timescale compression

One of Timescale's standout features is its native support for columnar compression, which can drastically reduce the disk space used by large tables without compromising query performance. Compression improves the performance of many queries.

Timescale's compression works by converting regular row-based data into a more compact columnar format. This process is particularly effective for time-series data, where many columns may contain repetitive values; we can achieve impressive compression rates (+90 %) by applying different compression algorithms depending on each data type. This means that your large tables can be shrunk up to 10x.

In Timescale, compression is enabled on a table-by-table basis via a time-based compression policy. For example, this code enables compression in a particular hypertable and automatically compresses data older than two weeks:

-- Enable compression on the hypertable
ALTER TABLE your_hypertable SET (
  timescaledb.compress_segmentby = 'some_column'
-- Set a compression policy to compress data older than two weeks
SELECT add_compression_policy('your_hypertable', INTERVAL '2 weeks');

To learn more about compression, check out our documentation.

Set up a data management strategy: Move older data to cheaper storage, delete the data you no longer need

The previous tactics are very helpful in reducing your database size, but there is another avenue you can leverage in Timescale to scale your storage effectively: tiered storage.

By creating a simple tiering policy, you can move older, less-accessed data to a bottomless object storage layer:

-- This policy moves all data older than one month to object storage 
SELECT add_tiering_policy('example', INTERVAL '1 month);

This object store has the following characteristics:

  • It comes with a lower price point than our high-performance storage, allowing you to store many TBs of data for much less.
  • It is infinite, meaning you can store as much data as you’d like.

This tiered storage architecture is ingrained in Timescale’s backend. The object storage is not a “bucket” detached from your database—instead, it’s an integral part of it. Queries will transparently access both storage layers without any action needed from you—you can just keep using standard SQL over the same schema. Once your tiering policy is set, there’s nothing else to consider!

In Timescale, you can tier your less-accessed data to a low-cost object storage layer, leaving your data accessible to ad-hoc queries but paying much less

We recommend moving data to the low-cost storage layer once it’s not being frequently accessed by your application since there’s a performance cost (the object store is not as fast as our regular storage). But you can keep running ad-hoc queries over this data comfortably (e.g., queries that are not critical for the user experience of your customers and for which top performance is not essential).

Editor’s note: We’ll share exciting news on tiered storage soon. 🎉 Stay tuned!

On top of offering this low-cost storage layer, Timescale makes it easy to set up data retention policies to delete data you no longer need:

-- Set a data retention policy to drop data older than 10 months
SELECT add_retention_policy('your_hypertable', INTERVAL '10 months');

You can combine data retention policies like the one above with continuous aggregates to effectively downsample your dataset—e.g., reducing granularity from one second to one minute, deleting the one-second values but keeping the one-minute aggregates. Read this blog post to see an example of how to do this and proactively manage long-term data.

Usage-Based Models and the Data Management Paradigm

While usage-based models may seem nothing but a pricing change on the surface, they bring about a paradigm shift in how developers think about their database size and how they perceive and handle data.

Usage-based models promote a culture of continuous improvement, where the focus shifts from mere storage management to database health and efficiency. This requires some discipline at first, but once your mindset shifts and you learn some techniques, you’ll be in the best place to scale your PostgreSQL database efficiently and effectively.

Timescale has valuable tools to help you systematically reduce your PostgreSQL database size, like compression, tiered storage, and data retention policies. This allows you to effectively scale your PostgreSQL databases in a usage-based model. To experience it yourself, sign up for Timescale—you can use it for free for the first 30 days.

- Written by Carlota Sota.

Also published here.