Scaling PostgreSQL: How We Tamed 10 Billion Daily Records and 350 TB+ of Data

Earlier this year, we embarked on our company’s largest ever. To build , a tool that allows our users to analyze their query performance within the Timescale platform, we sought to collect query statistics across all our customers’ databases, including timing, CPU, memory usage, and disk I/O related to each and every query. We needed a database that could handle a large amount of data, had powerful analytical capabilities (especially for analysis over time), and would not become an operational burden. dogfooding effort Insights Naturally, we picked Timescale, our mature cloud platform with TimescaleDB at its core. We’re used to working with , and we built TimescaleDB to make PostgreSQL faster and more scalable—what’s better than living by our own example? PostgreSQL The easiest way to describe this dogfooding experiment is with the numbers that help quantify its scale. To build Insights, we needed to collect query information across our fleet of continuously running production databases. We quickly collected more than about individual (sanitized) queries on the platform. 1 trillion records Now that Insights is live in production, we are ingesting over . The dataset served by a single Timescale service grows by roughly and currently totals over , and the same database service powers real-time dashboards across all our customers. 10 billion new records a day 3 TB daily 350 TB This blog post provides a behind-the-scenes peek into the process of building Insights. To operate at this scale meant pushing the limits of a single Timescale service, and scaling not just PostgreSQL but our developer empathy as well. We found Timescale more than up to the task, but there are also areas we want to improve! Scaling PostgreSQL to Analyze Queries Over our Entire Fleet This database observability tool allows users to understand better how their queries are performing. Insights correlates various metrics – including query latency, CPU, memory, and I/O – against concurrently executing queries, helping users find low-performing queries that might be causing issues. Once you identify the problematic queries, Insights gives you a more granular view of aspects like shared buffers and cache hit ratio over a certain period to help you figure out the issue. We just launched the end product of our dogfooding effort, Insights, on the Timescale platform. To make Insights happen, we had to put on our database administrator hat 🤠 and tackle a few technical challenges to scale PostgreSQL to many terabytes of data. We wanted to use a Timescale service as our central database, hosted on our platform with no “special” infrastructure. This meant the following: We had to build a pipeline capable of ingesting billions of records per day into a single Timescale service. Timescale can handle high ingestion rates, and it does it regularly for our customers, but this level of scale under production query loads always raises eyebrows. Our customers had to be able to query this database with the flexibility to power all the analytics that Insights offers, and we didn’t want to make them wait minutes for a response! We needed to store hundreds of TBs in a single Timescale service since each day we’re adding several TBs. Older data (i.e., older than a few weeks) needed to be accessible, but not necessarily fast to query. How We Built It: Powering Real-Time Analytics With a (Huge) Database Collecting and writing lots of data From the side, we leveraged the architecture of the Timescale platform. Timescale runs on Kubernetes (k8s), and we have several clusters running in different geographical regions. Those clusters have nodes that hold one or more customer database services. To collect the query executions for all those databases, we bubble up from that database to the regional level and then have a regional writer that stores batches of records in the Timescale database service that powers Insights. data collection k8s Pardon the hand-waviness that avoids some low-level gory details, but in broad terms, this is how things work: each database running across the fleet is instrumented to create a record (sanitized for privacy and security) after every query, including the query itself and statistics we care about. Those records are collected at the node level, tagged with labels to keep them associated with which database service they came from, and batched to be sent off to the regional writer. The regional writer service is replicated as necessary to handle the load in each region. Each writer collects batches from the nodes in each cluster and makes even larger batches. Those large batches are then written by first using `COPY` into a temporary table (no Write-Ahead Logging = fast). The entries in that temporary table are then used to update the necessary tables (see below). The temporary table allows us to use `COPY` without worrying about duplicates, which are handled by subsequent operations munging the records from the temporary table. Summing it up: We have k8s clusters in several geographical regions. Inside those regions are nodes that run one or more customer databases. Sanitized query statistics from customer databases are collected and batched (inside the region). Once there is a sufficient number of events, these are sent to our centralized Timescale database service. Let’s zoom into the that’s powering Insights. We’re running Insights in an “off-the-shelf” Timescale service with a , which is also used for read-scaling (more on that later). By “off-the-shelf,” we mean that we’re using all the same features available (or soon-to-be available 😉) to our customers. No special black magic because we also control the infrastructure. database high-availability replica The database powering Insights has quite a few parts, but we’ll try to highlight the most important ones. First, we have two regular PostgreSQL tables which serve as “reference tables.” These tables contain information database metadata and query string metadata. Here are their (pseudo)schemas: Database Metadata Table "insights.cloud_db" Column | Type | Collation | Nullable | Default ---------------+--------------------------+-----------+----------+-------------------------------------- id | bigint | | not null | nextval('cloud_db_id_seq'::regclass) service_id | text | | not null | project_id | text | | not null | created | timestamp with time zone | | not null | now() Indexes: "cloud_db_pkey" PRIMARY KEY, btree (id) "cloud_db_project_id_service_id_key" UNIQUE CONSTRAINT, btree (project_id, service_id) Query Metadata Table "insights.queries" Column | Type | Collation | Nullable | Default ---------------+--------------------------+-----------+----------+-------------------------------------- hash | text | | not null | normalized_query | text | | not null | created | timestamp with time zone | | not null | now() Indexes: "queries_pkey" PRIMARY KEY, btree (hash) Any time a new database starts having queries run against it, it’ll get added to `insights.cloud_db`. Any time a new normalized query is run, it’ll get added to `insights.queries`. (What is a normalized query? It is a query where all constants have been replaced with placeholders: $1 for the first one, $2 for the second, and so on, so we only see the “shape” of the query, not its values.) Up to this point, we’re just using regular Postgres with no Timescale secret sauce. But the other important objects in the database are unique to TimescaleDB, helping scale PostgreSQL to another level. This is where the magic happens: hypertables and continuous aggregates. are Timescale’s tables. They automatically partition the data by a dimension while it’s being ingested, making it much easier to scale PostgreSQL tables up to large scales. Hypertables are the building blocks of Timescale. We’re storing our query stats metrics in a huge hypertable, as we’ll see later. Hypertables automatically partitioned are Timescale’s improved version of PostgreSQL materialized views, allowing for and materialization, which proved very useful when building Insights. Continuous aggregates incremental automatic Let’s cover how we used these features to enable fast analytical queries on the users’ side. Powering (fast) real-time analytics As we were saying, we use a large hypertable to store information about every query execution. This hypertable is our main table, where the sanitized raw metrics live. It looks somewhat like the following, and is configured to use its timestamp column ( ) to automatically partition the data as it’s ingested. created Table "insights.records" Column | Type | Collation | Nullable | Default -----------------------------+--------------------------+-----------+----------+--------- cloud_db_id | bigint | | not null | query_hash | text | | | created | timestamp with time zone | | not null | total_time | bigint | | | rows | bigint | | | ... We’ve omitted a bunch of stats for this example, but you get the idea. Now, we have to allow fast queries from the user side—but this table is huge. To speed things up, we relied heavily on continuous aggregates (using , to be exact). hierarchical continuous aggregates Continuous aggregates make so much sense in a product offering real-time, user-facing analytics like Insights. To provide actionable information to users, we need to aggregate metrics: we aren’t showing users a log of every query they ran with statistics next to it—some databases are doing thousands of queries per second, so it would be a nightmare to find anything useful. Instead, we are serving users aggregates. So, we might as well take advantage of the fact that we’re not showing raw individual records to users and keep the result for fast responses. For customers, this means a small trade-off in precision when using Insights (i.e., they can’t get exact time ranges down to seconds), but that is a very reasonable price to pay. materialized We could have used PostgreSQL materialized views, but Timescale’s continuous aggregates have several advantages that were especially useful for us. We refresh the views a lot, and continuous aggregates have built-in policies for automatic refreshes, and they refresh incrementally. We refresh views every five minutes, so instead of re-generating the entire materialized information every five minutes, continuous aggregates incrementally update the view by tracking changes in the original table. At the scale we are operating, we just can’t afford to scan our main hypertable from top to bottom every five minutes, so this functionality of continuous aggregates was a fundamental “unlock” for us. In these continuous aggregates powering Insights behind the scenes, we’re also aggregating most of the interesting statistics into a , which is part of our collection of in Timescale. UDDSketch specializes in capturing approximate statistical measures from the data it processes. This allows us to present essential metrics like the median, 95th percentile (or P95), and more for metrics such as latency or row count. UDDSketch hyperfunctions Still, at a certain point, the database started doing a lot of work to insert all these raw records and then materialize them for serving. We were hitting some limitations on how much we could ingest and keep up. To further increase our ingestion rate to the level we needed, we offloaded the UDDSketch generation from the database to the region writers. Now, we still store some amount of records as “raw” records, but we also push the rest into pre-generated sketches that we store in the database: Table "insights.sketches" Column | Type | Collation | Nullable | Default -----------------------------+--------------------------+-----------+----------+--------- cloud_db_id | bigint | | not null | query_hash | text | | | created | timestamp with time zone | | not null | total_time_dist | uddsketch | | | rows_dist | uddsketch | | | ... The best part of UDDSketchs is that it’s very easy to continuously “roll up” the sketches to support larger time ranges. Using such a rollup, sketches that cover narrower time ranges can be aggregated into a sketch that covers a wide time range, both when building a hierarchical continuous aggregate and at query time. Another tool we leveraged to ensure both fast ingest and queries are read replicas. Using replication is paramount for both high availability and performance in our case, given that Insights powers a major, customer-facing feature for the Timescale platform. Our main database instance is quite busy with bulk work, writing data, materializing the continuous aggregates, running compression, and more. (More about compression in a minute.) To ease some of its load, we let the replica service customer read requests from the Insights console. Storing hundreds of TBs in a single Timescale instance Lastly, we needed to comfortably fit hundreds of TBs into a single Timescale service. The Insights database is scaling fast: it was on the order of 100 TB when we started, and it’s now over 350 TB (and counting). To store that much data efficiently, we enabled in our hypertable and in all our continuous aggregates (yes, you can also compress continuous aggregates since they’re essentially child hypertables!). We also got our compression rates dialed up—the team knows quite a few things about optimizing compression, after all. Timescale’s columnar compression We’re witnessing upwards of rates on our main hypertable. 20x compression Another big win when managing a hypertable was the schema mutability of compressed data. We described our schema in a previous section, but as you can imagine, we’re changing it often to add more stats and so on—it’s so useful to be able to do this directly in the compressed hypertable. very large approximate We are also heavy users of Timescale’s data tiering. This feature went into early access earlier this year (wait for GA news soon 🔥) and allows us to keep hundreds of TBs accessible via our Timescale database. Data tiering has also proved very efficient: we see amazing compression rates here too, with 130 TB shrinking down into a highly resource-efficient 5 TB. Scaling a 350 TB+ PostgreSQL Database Using Timescale: Lessons Learned The process of building Insights showed us how far our product can actually go, but the best thing was walking a few miles in our customers’ shoes. We learned a lot about the user experience of scaling PostgreSQL with Timescale, and we’ve added some things to our to-do list as the engineers behind the product. Let’s go through it all: the good, and the so-so. The Timescale Highlights Pardon our immodesty, but we felt pretty proud of our product at times. . We spent a couple of weeks tuning the database when it started ramping up, but now , without babysitting or constant monitoring. (Note that this is different from being unmonitored, it is definitely monitored!) Ingesting tens of billions of records daily into a single PostgreSQL database with already hundreds of TBs is nothing to sneeze at it just works Our view was pivotal in this process, providing valuable insights and allowing easy configuration changes to optimize performance, for example, when optimizing chunk (partition) sizes. Explorer Compression worked so well for us. As we shared in the previous section, we got impressive compression rates (20x!) using a simple single `segmentby` option. For us, the experience of setting up and adjusting the policy was not hard—although, of course, we built this feature…one could say we have a edge. 🙂 Plus, the ability to seamlessly add new columns into compressed data further enhanced the flexibility and adaptability of our database. We used this capability without complications. slight Continuous aggregates simplified the logic in constructing different time periods, streamlining data analysis and processing. We used tons of hierarchical continuous aggregates. The approximation algorithms included in Timecale’s hyperfunctions simplified our implementation and greatly scaled our analysis. The ability to easily rollup sketches was also key for efficiently supporting different time ranges and time bucket granularities in our customer-facing Insights dashboards. The “infinite” warm storage that a Timescale database has at its disposal via data tiering was critical for scaling to 100s of TBs, with plenty of headroom to grow. Our current __ __retains three weeks of records in hot storage. data tiering policy Finally, we used the ability to create custom jobs to enhance observability (like monitoring job history) and implement experimental refresh strategies. Opportunities for improvement (we took note) After telling you all the great things, it’s time to acknowledge the not-so-great ones. Nothing’s perfect, including Timescale. We faced a few challenges while implementing our pipeline, and we don't mean these as grievances: Database observability could be improved in the Timescale platform, particularly around jobs and the performance of continuous aggregates’ materialization. TimescaleDB mostly provides snapshot-based views, making it challenging to understand performance and trends over time. For example, there is no out-of-the-box “job history” table available. Early on, we noticed that the incremental materialization of our continuous aggregates was seemingly taking longer and longer—eventually leading to discovering a bug—but we had no way to confirm or quantify the scope. As we noted previously, the ability to define custom jobs and run them within Timescale’s job framework did allow us to create a “good enough” version of this. We would continuously query views we wanted to monitor over time and insert any changes into a hypertable. This works for Insights for now, but we’re also working to turn some of these things into built-in functionalities because we think they’re crucial once you scale Timescale past the point of everything-is-fast-all-the-time. . Continuous aggregates can be tricky to get right when the underlying data is large Using the `__ __is a lifesaver. It is also important to be judicious with your offsets for the refresh policy, so that the amount of data you are incrementally refreshing does not accidentally grow too large. WITH NO DATA` option when creating continuous aggregates Even if you follow this advice, you could still end up with a continuous aggregate that takes longer to refresh than the amount of data you’re trying to materialize, e.g., taking 30 minutes to materialize 15 minutes of data. This happens because, sometimes, the continuous aggregate underlying task is too large to fit in memory and spills to disk. We ran into this issue, which was exacerbated due to an off-by-one bug we found (now fixed) that caused extra chunks to be included in the query plan even when they ultimately would contribute no data to the materialization. Finding this bug was actually a case of “dogfoodception”: we discovered this performance problem while using Insights as we were building it 🤯. The timing information we saw in Insights suggested something was amiss here, and we uncovered the issue by using EXPLAIN and looking at the plans. So we can tell you that it works! To make materialization faster, we ended up creating a custom incremental refresh policy that limited the size of the increments to refresh. We are working on seeing if this is something we can generalize back into TimescaleDB properly. . Change is hard at scale Once your data has reached a certain size, some DDL (schema modification) operations in TimescaleDB can take more time than ideal. We’ve already experienced this in a couple of ways. For example, adding new indexes to large hypertables becomes an exercise in timing. Because TimescaleDB does not currently support using `CONCURRENTLY` with `CREATE INDEX`, the next best option is to use its built-in method to create the index one chunk at a time. In our case, we have to kick it off right after a new chunk is made, so the locking on the “active” chunk is minimal. That is, creating an index when a chunk is new means that it is (nearly) empty and, therefore, can complete quickly and not block new inserts. Another way change is hard is when updating continuous aggregates to add new metrics (columns). Continuous aggregates don’t currently support `ALTER`. So, when we want to expose a new metric to users, we create a whole new “version” of the continuous aggregate, i.e., for continuous aggregate “foo” we would then have “foo_v2”, “foo_v3”, etc. This is less than ideal but is currently working. Finally, altering compression settings is quite hard at scale. In fact, it’s effectively not possible for us right now, as it would require decompressing all compressed chunks, altering the settings, and then recompressing them, which is not feasible at our current scale. We continue to brainstorm with our colleagues to get workable solutions to all these things. Not just for us, but for all Timescale users. Wrap-Up That was quite a bit of information to drop all in one post. But if you need a to enjoy it later, that’s okay too! doggy bag Building Insights was a profound experience for our team. We’ve seen first-hand how far we can take Timescale, getting it to impressive scale numbers. The pain points we encountered along the process have given us so much customer empathy—that’s the beauty of dogfooding. Next year, I hope to write another blog post on how we’re monitoring another order of magnitude more databases and how we’ve continued to improve the experience of working with Timescale at scale. See you then! 👋 Also published here.