My Prometheus is Overwhelmed! Help!

Prometheus is an incredibly popular option for monitoring and time series data on kubernetes. Many developers simply install it and let it do its thing. So it can come as a shock when it starts getting overwhelmed. Don’t panic! We’ll help you understand some of the cases you might be encountering and what your options are. The ‘Setup and Forget Strategy’ The setup and forget strategy can take you a long way. But you might need to know a bit more if: You have a high-volume use case (which you may not even realise). You need the data to be there over long time periods (years or at least months). You rely heavily on consuming the data in your apps. Something breaks. Prometheus and most other time-series databases work very differently from SQL databases. Let’s understand this better. Arrgh! It’s Broken! Let’s say you’ve figured out how to expose a metrics endpoint in your app running on kubernetes. You’ve built a grafana dashboard to monitor your apps health or some other data and that looks pretty nice. Or maybe you’ve built your own UI that queries prometheus directly. All is good with the world… until it breaks. Queries Become Slow There can be various causes for slowness. First try to rule out the least interesting ones: If you are calling prometheus in your own code, are you closing and timing out your http connections to prometheus? Have you allocated ? Try increasing it and replicating the load as a test (preferably in an environment with similar underlying hardware). enough resources to prometheus When you’ve checked increasing CPU and RAM, also try disk. It could be that the disk is getting full and prometheus is having to clean up (more on this in the ‘Data Disappearing’ section below). The more interesting thing that could be going wrong is that the way the data volume is increasing is causing the queries to slow down. This can happen surprisingly quickly. There could be a change to the systems producing the data but it can also be that apparently small changes have big effects. The reason we often don’t anticipate how an increase in data will affect prometheus is because we forget to look at . Here’s an important but slightly complicated part of the : cardinality prometheus documentation “Labels enable Prometheus's dimensional data model: any given combination of labels for the same metric name identifies a particular dimensional instantiation of that metric (for example: all HTTP requests that used the method POST to the /api/tracks handler). The query language allows filtering and aggregation based on these dimensions. Changing any label value, including adding or removing a label, will create a new time series.” So the upshot is that every different value to every label increases the work prometheus is doing significantly. You have to consider not just every metric value at every scrape, but . For this reason, it’s dangerous to . It’s also dangerous to use labels like userid, which can have many distinct values. every combination of labels that is being applied autogenerate label names in your code We should think of prometheus as a monitoring system rather than a database. If there’s a lot of variation in the labels then it’s (more on this below). suggested to look at a database If you hit this problem, there are ways to check the cardinality of your time series. You can run a query. See the ‘Containing Your Cardinality’ for more on this and how to reduce cardinality. sum(scrape_series_added) by (job) presentation slides Prometheus Actually Crashes Prometheus crashing could be an effect of one of the problems discussed above. Maybe it is out of memory due to working too hard. Unless you’ve got better information (e.g. from the logs) I’d try looking into the above. There’s other that could happen for much the same reasons, like scrapes becoming slow or memory spikes. related failure behaviour Data Disappearing Let’s say you have some query you run to show how many times something has happened. It keeps going up every time the event happens and all looks good… then the value mysteriously goes down. How can that happen? Well prometheus does not normally keep data forever. Actually (which you can configure). Note that’s global, not filtered by any particular type of data. by default it has a retention period of 15 days There’s also an option to tell prometheus how much disk space it should use. So if the data starts taking up too much disk space, then it will start deleting data. This can be counter-intuitive if you approach prometheus like a traditional SQL database. It isn’t designed for long-term storage (although you can set it to retain for very long periods if you have the space). It’s designed for monitoring so it’s about transient data about what is happening over a constrained time window (you can think of it as “what’s going on now” data). Out of Disk Now we know that prometheus will delete data beyond the retention period by default. Newer versions of prometheus might soon also delete data if the allocated disk space is used up, though that’s not the default at the time of writing (you have to configure it). So if you’ve not configured this then it’s possible you can run out of disk. Queries Exceed Maximum Data Points If a single query return would too many data points, prometheus simply won’t fully execute it. Instead you’ll get a message back saying that the query , by default 11,000. exceeds the data points limit Typically you’d need to be running a query over a pretty long time window to hit this problem. I hit it when trying to run a query over a period of months. But it can depend on how much data you are collecting and how dense it is (including how short your scrape interval is). When I first hit this my first response was big overkill. I wrote an that read metrics from prometheus, summed up data over intervals to make the gaps bigger (thus reducing number of data points) and put this in as a new time series. (This is an usual use of an exporter, which normally is used for scraping data on behalf of another service rather than pulling from prom and putting back again.) exporter project What I was doing is called downsampling. Taking the data and restructuring it increase the gaps between data points so that there are fewer data points overall. The easiest way to do this is usually a (which is what I then realised). These are basically queries that you write into your prometheus config that create new time series from existing ones. recording rule Do I Really Have to Reduce My Data? Isn’t There An Easier Way? What we’ve said so far basically amounts to: Don’t have lots of dense data over a long period. Especially don’t try to query lots of dense data over a long period. If you must query a lot of data, look at restructuring your data to make it less dense. So at this point you’re probably wondering ‘do I really have to reduce my data?’ It really depends on your situation. For anyone wondering whether there’s some tool out there that can make this easier, let’s take a look at some tools that either complement or replace prometheus and how they compare. Extending Prometheus High Availability You might be thinking, “can’t I handle more data by running more instances of prometheus?” The answer is both yes and no. With HA Prometheus, each prometheus handles some of the data. The to do this is ‘functional sharding’. This means for each service being scraped, all of its data is handled by just one prometheus. Functional sharding not the only way to shard data, but is the simplest one. It means each prometheus has a clear, dedicated remit. recommended way Functional sharding is more for scaling numbers of services. If you’ve got a single service producing a lot of data (e.g. too much data for your queries) then functional sharding in itself isn’t going to help you. Prometheus data can be for high availability. When you split the data up then you need a way to put it back together for querying. This is . Basically certain prometheus instance/s collect data from the other ones, so that data is sufficiently consolidated that you’ll know which prometheus to query for it. sharded differently achieved with federation The suggests having some prometheus instance/s that just have aggregated global data. Aggregation would be a way to reduce cardinality and make it possible to run queries that would otherwise hit limits. But setting this up . So for reducing cardinality in it’s the recording rule that is doing the work, not the federation. However, this does not mean that recording rules are the only way. prometheus documentation on federation still requires recording rules Thanos Thanos is another open source project in the CNCF. It complements prometheus and can be used to better scale prometheus. Its key points: Adds layer to prometheus for scaling. Supports downsampling, long-term storage and aggregation of data from multiple prometheus instances. Has several components inc. prom sidecars, compactor and query module - is quite heavy-weight if all are used. Can be tricky to install and properly test due to number of components. Query module supports PromQL. For details on all these components, there’s a . The main idea is that thanos can help with long-term storage, query data point limits and federation. But it’s not a one-click solution. It has different components targeted at each concern. good overview article from AWS If your key concern is individual queries hitting limits (which was the main issue I was facing before) then the particular component that will be of interest is the . That component can automatically downsample data so that queries will be able to run over larger time horizons without hitting max data points limits. compactor Thanos is not the only tool that can work with prometheus to help it scale. There are a number of tools that can take prometheus data and store it for the long-term and there’s a . listing of them in the official prometheus docs There’s also alternatives to prometheus out there. Actually there are some tools that can be used either with prometheus or instead of prometheus. This can make the options confusing so let’s try to clarify a bit. The Time Series Databases Scene This is a selective look at some time series databases. It is not comprehensive. My aim is to cover a selection that gives a good picture of the range of options and how to understand their approach and purpose. One thing that confuses people about time series databases is that they’re not based around a standard like SQL or a single design philosophy like relational databases. Any database that works well for storing pairs of a timestamp and a value and associated uses for that data (e.g. monitoring). InfluxDB InfluxDB is designed as a time series database suitable for metrics. It can be an alternative to prometheus or it can be a backend for prometheus as long-term storage. If run on its own then it collects the data. If run with prometheus then prometheus collects the data and InfluxDB gets it from prometheus. Some key points on InfluxDB: Does downsampling and long-term storage. Younger and not as much presence as prometheus. Looks easier to run than Thanos. Needs its own query format, . does not support PromQL Able to . scrape like prometheus Elasticsearch Elasticsearch is of course a document-based database and search engine so this one could be a surprise. But elasticsearch can also be used for time series data. And elastic can be used as . long-term storage for prometheus Elastic can be . They have an example on that suggests it can do downsampling at query time. It can also ingest . used for time series natively, without prometheus CPU metrics kube-state-metrics The main challenge for prometheus users interested in elastic is that elastic is not so well established for these use cases so detailed examples can be tricky to find (at least at the time of writing but if anyone has some then feel free to contact me e.g. on ). twitter suggests both elastic and influx can be used for time series and that (e.g. nlp use-cases, EFK log collection). This makes sense - elastic is a document database that can also do time series, whereas influx is specifically for time series. Comparison with influx influx cannot be used for text TimescaleDB TimescaleDB is relational and based upon postgres. Some key points about TimescaleDB: Open source with a managed offering also available. Supports and long-term storage. downsampling TimescaleDB authors offer interesting . sample datasets Can be used as a long-term store for prometheus data. Can support with Promscale. Prometheus-sytle scraping Can support PromQL with . Promscale Others There are a lot more offerings in this space that we’ve not covered here such as Cortex, VictoriaMetrics, M3db, Graphite, Datadog and . The above selection is intended to give a flavour of the variety of the space and help readers explore for themselves. more You Are Not Alone If your prometheus gets overwhelmed, remember you are not alone. It is quite normal to hit limitations with prometheus. There’s a whole space of tools to address this. There’s even newly-emerging approaches that we’ve not touched on here (such as ). detection of how and when cardinality explosion happens There’s no single easy solution that works for all cases. You need to think about your situation and what matters most for you. My top tips to leave you with are: Really explore the prometheus UI and PromQL. All the recording rules and what is being scraped and so on is in the prometheus UI if you know how to find it. Use slack groups to ask what others did. This article has benefited greatly from conversations on the kubernetes and data on kubernetes (DOK) slack groups.