Products like Amazon RDS for PostgreSQL are fine for smaller deployments, but scaling PostgreSQL is a different story. Once the project grows to many terabytes, these managed databases get slow and expensive, making data management much more complex.
Performance suffers once tables reach billions of rows, and while
But not anymore. Today, we’re happy to announce the General Availability of Tiered Storage, a multi-tiered storage architecture engineered to enable infinite, low-cost scalability for your time series and analytical databases in the Timescale platform. You can now store your older, infrequently accessed data in a low-cost storage tier while still being able to access it—without ever sacrificing performance for your frequently accessed data.
This is what happens when you insert data into a Timescale time-series database with our new multi-tiered storage backend:
Your most recent data will be written into a high-performance storage tier optimized for fast queries and high ingests.
Once you don’t access that data frequently, you can automatically tier it to a lower-cost object storage tier by setting up a tiering policy. The data in the low-cost storage tier remains fully queryable within your database, and there’s no limit to the amount of data you can store—up to hundreds of TBs or more. Our low-cost storage tier has a flat price of $0.021 per GB/month for data—cheaper than Amazon S3.
Developers need a cheap way to scale their large PostgreSQL databases in AWS without compromising performance. While object stores are amazingly scalable and affordable, they are not the fastest, and developers also need to get millisecond query responses for their applications.
However, once data becomes older and rarely accessed, real-time performance is often not as essential. Developers still need to be able to access this historical data for ad hoc queries, data analysis, or regulatory compliance, but they can assume some latency for these types of queries. Now, what developers want is the ability to store this historical data as affordably and efficiently as possible.
This new Tiered Storage architecture frees developers from choosing between storage costs and performance trade-offs for real-time applications. By keeping their recent and frequently accessed data in the high-performance tier, they’ll leverage the millisecond query speed and ingestion capabilities of Timescale, and by tiering their older data, they can keep as many TBs as they need in their PostgreSQL databases for less.
Timescale’s Tiered Storage architecture leverages the flexibility of PostgreSQL and
This storage architecture also eliminates any storage limitations in Timescale services: since our low-cost storage tier is infinite, you can store as many TBs as you’d like. For example,
This Insights database is constantly collecting and analyzing query statistics from across our customer fleet, and it’s exceeded 350 TB today and is growing quickly. From those 350 TB, 250 TB are tiered to low-cost storage.
Let’s do the math:
We store 5 TB in our high-performance storage tier after compression. Of course, we have compression enabled, and we’re getting compression ratios of 20x—meaning that what was originally 100 TB of Postgres data now fits in a 5 TB disk thanks to compression (!)
The remaining 250 TB of data are stored in the low-cost storage tier. This tiering happens automatically once data hits a certain age, which currently is several weeks old.
Our customers with large deployments are also already making use of Tiered Storage:
"We perform a lot of analysis on market data, and the sheer volume of data we need to store makes a normal disk-based database solution unfeasible (it's just too expensive). Timescale’s Tiered Storage seamlessly allows us to move large volumes of data to the object storage layer. This is a great solution to store large volumes of historical data and perform post-analysis. Without this, we'd be forced to develop a solution in-house."
- Chief Technology Officer at a Proprietary Digital Assets Trading Company
Simplicity is the key feature of Tiered Storage. To move your data from the high-performance tier to the low-cost object tier, all you need to do is use our
Editor's Note: Timescale hypertables are PostgreSQL tables automatically partitioned by time. Those partitions are called chunks. Chunks are the unit of user-defined policies in Timescale: for example, when you define a data retention policy, you’ll be deleting entire partitions (chunks), and when you move data between storage tiers, you’ll be moving entire chunks. This is a very convenient way to manage data from a resource-utilization and developer experience perspective.
For example, this policy would move all data older than one month to object storage in your events
hypertable:
SELECT add_tiering_policy('events', INTERVAL '1 month');
That’s all you need! Everything else happens automatically, including intelligent query planning that only executes any SQL queries on the appropriate tier.
To drop data currently stored in the low-cost storage tier, you can define a data retention policy so data is deleted automatically after a certain period (e.g., after five years).
Also, if you want to “move back” a particular chunk from the low-cost storage tier to the high-performance tier (
-- Untier a particular chunk
CALL untier_chunk('_hyper_1_1_chunk');
You can track how much data has been tiered (and how much it will cost at the end of the month) in the Timescale console:
And speaking of billing estimations…
Our high-performance storage tier has an effective price of $0.177 per GB/month of data
When tiering data, you will only pay for the data you store, not for the queries executed or the amount of data scanned: this is truly a flat price. Our goal with this pricing structure was to provide a transparent, unambiguous, and simple way to calculate the total data storage cost, making it easier to manage your data.
As a quick example, let’s say you have a hypertable with 5.2 TB of storage, with recent hypertable chunks and other Postgres tables taking up approximately 200 GB and about 5 TB of hypertable data older than a month. You do not access or query this older data frequently, meaning you don’t need it for the daily operations of your application. Still, you’d like to keep it accessible in your database for ad hoc queries or compliance requirements (we see many such cases among our customers).
As a cost-effective data management strategy, you can tier all chunks older than a month to the low-cost tier and decrease the cost of storing that 5 TB of data from around $478/month to around $105/month, a 78% decrease in your storage bill. (For this estimation, we assume you enabled compression for your hypertable and consider the median overall compression in all Timescale services).
The savings will increase alongside your data: when moving multiple terabytes to this low-cost tier, your storage bill will decrease from thousands of dollars to a few hundred. The following reference table illustrates how affordable our low-cost storage tier actually is.
You get the point!
There is one more thing that makes Tiered Storage even more amazing: when you keep data in the low-cost storage tier, you pay for this data only once, independently of if you have a high-availability replica or read replicas running in your service.
The same applies to forks. In Timescale, you can create copies of your primary database (we call them forks) by clicking one button from the UI, for example, for running tests or creating dev environments. When creating one (or more) forks, you won’t be billed for data shared with the primary in the low-cost storage. If you decide to tier more data that’s not in the primary, you will pay to store it in the low-cost tier, but you will still benefit from substantial savings by moving it from the high-performance tier of the fork to the cheaper tier.
To make this crystal clear, let’s roll out an example. Imagine that you have a primary service with 6.5 TB of data and that you also have set:
A high-availability replica to significantly reduce the risk of downtime and data loss due to failures
A read replica to serve your read queries and allow the primary to dedicate itself fully to writes
A fork of that service for development and testing purposes
From a billing perspective, if you kept the 6.5 TB of data in your primary service in the high-performance storage tier, you would see reflected [6.5 TB x 4] in your storage bill to account for the two replicas, the fork, and the primary service—26 TB in total. Assuming our median compression rate, this would be costly: around $4,602 /month.
But what happens if your application needs to actively access only the most recent 500 GB of data? You could tier 6 TB to low-cost storage and only keep 500 GB on your high-performance storage tier. And since you only pay for the data in your low-cost storage tier once, this is how your new storage bill would look like:
[500 GB x 4] = 2 TB in high-performance storage (instead of 26 TB)
[6 TB x 1] in the low-cost storage tier
The above storage bill would come up to approximately $480 /month: you would save over $4,000/month! This is the savings' multiplication effect of Tiered Storage. Especially if you’re running replicas or forks, it is a great idea to take advantage of the low-cost storage tier—the savings you’ll see in your overall bill will be very significant.
We released an early version of our Tiered Storage functionality in the Timescale platform as Early Access in March. After many improvements, it has now reached General Availability. Since its first release, we have worked hard to make Tiered Storage stable, reliable, and faster.
We also thought long and hard about our pricing model, iterating consistently over multiple ways to price our low-cost storage tier. Finally, we leaned towards the simplest: a flat rate over the data volume pre-compression. We’ve already mentioned how we valued simple and predictable pricing, but it was also important for us to provide a price per GB/month as low as possible. Our price landed at $0.021—as a comparison, storing files on Amazon S3 costs
Personally, I (Yannis) must admit that after leading teams building cloud-native solutions for more than a decade, I still have to go back and occasionally recheck multiple rate tables on various cloud pricing pages, especially looking for extra fees, every time that I want to compute the total costs for our services accurately.
At Timescale, we believe that you should not have to build a complicated spreadsheet to be able to run a database service or make informed choices about storage layers.
Our commitment to making developers’ lives easier led us to the flat rate of $0.021 per GB/month—no guesswork, hidden costs, or charges per query or data reads.
Data volume pre-compression means the data volume before applying Timescale compression. For example, due to Timescale compression, a 500 GB volume in the high-performance storage tier might end up needing only 50 GB of disk space once compressed. If you decided to tier this data to low-cost storage, your bill would be calculated over the original 500 GB volume, as in [500 GB * $0.021] per month.
All data inserted into Timescale is initially written into our high-performance storage layer. Using faster disks for your most recent data will bring top insert and query performance for your most recent values, a usage pattern that fits the needs of data-intensive applications.
On the contrary, the low-cost storage tier is an object store built on Amazon S3. This object store is much more than an external bucket to archive your data, though: it’s an integral part of your database. When you move data to this object storage tier, your database will remain fully aware of all the semantics and metadata, and you can keep querying as usual with standard SQL (albeit with slower performance).
Behind the scenes, we’re storing the data in a compressed columnar format (specifically, Apache Parquet). When data is tiered, chunks stored in Timescale’s native internal database format (typically in our
When you run your SQL query, it will transparently pull data from the high-performance storage tier, the object storage tier, or both, as required. When we say transparent, we mean transparent: Timescale supports arbitrarily complex queries across its standard and tiered data, including complex predicates, JOINs, CTEs, windowing, hyperfunctions, and more.
In the example query below (with an EXPLAIN
clause), you can see how the query plan includes a Foreign Scan
when the database is accessing data from the object storage tier. In this example, devices
and sites
are standard Postgres tables (residing only in high-performance storage), while metrics
is a hypertable that stretches across both storage tiers. When executing this query against the metrics
hypertable, three of its chunks were read from standard storage, and five chunks were read from object storage.
EXPLAIN
SELECT time_bucket('1 day', ts) as day,
max(value) as max_reading,
device_id
FROM metrics
JOIN devices ON metrics.device_id = devices.id
JOIN sites ON devices.site_id = sites.id
WHERE sites.name = 'DC-1b'
GROUP BY day, device_id
ORDER BY day;
QUERY PLAN
----------------------------------------------------------
GroupAggregate
Group Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
-> Sort
Sort Key: (time_bucket('1 day'::interval, _hyper_5666_706386_chunk.ts)), _hyper_5666_706386_chunk.device_id
-> Hash Join
Hash Cond: (_hyper_5666_706386_chunk.device_id = devices.id)
-> Append
-> Seq Scan on _hyper_5666_706386_chunk
-> Seq Scan on _hyper_5666_706387_chunk
-> Seq Scan on _hyper_5666_706388_chunk
-> Foreign Scan on osm_chunk_3334
-> Hash
-> Hash Join
Hash Cond: (devices.site_id = sites.id)
-> Seq Scan on devices
-> Hash
-> Seq Scan on sites
Filter: (name = 'DC-1b'::text)
In the above query plan, the Foreign Scan on osm_chunk_3334
corresponds to fetching data from the bottomless object storage tier.
To avoid processing chunks falling outside the query’s time window, we perform chunk exclusion to only touch the chunks that are needed to satisfy the query. The database stores various forms of metadata to build a “map” of row groups and columnar offsets within object storage.
Furthermore, when a query is run, it is further selective about the data it reads. If your query only touches a range of rows and a few columns, only that subset of the data is read from the S3 object behind the low-cost storage tier.
In the above, for example, only the device_id
and value
columns are read from the object storage tier. If an additional time-based WHERE
filter had been included, the database would first use its metadata (stored in high-performance storage and cached in memory) to further reduce which Parquet files and row groups need to be read to execute the query. All to reduce query latency even when accessing this bottomless storage transparently through PostgreSQL.
Storage decisions around historical data don’t have to be costly. In Timescale, you now have access to a low-cost, infinite storage tier with no pricing gotchas that allows you to scale your database storage for an affordable price without compromising the performance of your application.
If you’re wondering if this is a good solution for your use case,
- Written by Yannis Roussos, Carlota Soto, and Ana Tavares.
Also published here.