Once, I needed to implement metrics in my team's services. I didn't understand what I wanted to get from the beginning. It's one thing to use a library and draw graphs and another thing to show meaningful data.
I needed a guide combining these two things: “Why it's done this way” and then “How to do it right”. As a result, I had to write such a guide myself.
Its goal is to explain to developers with any background what metrics are and how to think about them and use them meaningfully. We will deal with Prometheus and Grafana. If you have a different tech stack, it's okay. We will also touch upon fundamental topics: for example, percentiles, derivatives, and cardinality.
My guide will be long:
Before plugging in libraries, bringing up servers, and writing queries, let's restore justice and find the introduction that I lacked when studying metrics.
You want to collect different metrics "to see everything happening in the services".
Now ask yourself and your team important questions:
Study common approaches and find a ready-made tech stack to avoid inventing your own wheel. But there will be “its own atmosphere”. And no colleague or guide explains why it is so strangely organized.
Technically, the task looks something like this: the company already has, for example, Grafana (if not, it's not hard to pick it up). There are a hundred ways to collect and transfer data, but the following questions arise at once:
We will deal with all this.
For complex technologies, it is good to have not only a manual but also the right mindset. Then, all the tools will be easier to use.
In general, metrics are unexpectedly complicated. We are trying various engineering and mathematical tricks to compress an inconvenient, large data set into something visual. Unlike logs, which are simply written "as is" and are probably already used in your services. By the way, why not parse logs and build some graphs based on it? Of course, you can do it this way, and it's fine - up to a certain limit.
Logs are about accuracy: we collect information on every event. If you aggregate them, you can see the big picture and even pull out details down to a single request. But sending logs for every "sneeze" doesn't scale well.
Log traffic depends on the application traffic: more requests, more actions - and there are proportionally more logs. They are often unstructured and difficult to index and parse - you need to pull numbers out of the text for graphs. Traditionally, much unnecessary stuff is written into logs, meaning processing and aggregation will suffer and lag behind in real-time.
Metrics are a general picture of the application state at once. We do not collect all the details, but only a ready squeeze: for example, the number of requests to the service. We will get a picture but cannot get into the details. But this approach scales well and works quickly. We can react instantly and send out alerts. If we need details, we think and add a little bit at a time.
For example, count the number of requests to different endpoints. Metrics traffic depends on their number and the frequency of collecting/sending them to the repository. Finally, we do not write anything unnecessary into metrics, but, on the contrary, we try to write as little as possible.
Writing everything and then thinking about it (at the cost of traffic and processing complexity) vs thinking and writing only what is necessary (at the cost of losing details).
If you don't have many services and don't plan to scale, you can keep it simple and stop with log parsing. The popular ELK stack has built-in visualization tools.
Metrics collection can be organized in 2 ways. Each has its pros and cons. The deeper you dig and the larger the scale, the more complex the choice becomes.
Push. The principle is the same as with logs or a database: we write data to the storage when an event occurs. Events can be sent in batches. The method is easy to implement from the client's point of view.
Pull. Here, everything is vice versa. Applications store compact data in memory, like processed requests: 25
. Someone periodically comes and collects them, for example, via HTTP. From the client's point of view, pull is more difficult to implement but easier to debug. Each application has an endpoint where you can go with your browser and see what is happening in that particular application. There is no need to write tricky queries, filter data by replicas, and generally have access to a common metrics repository. In addition, this model encourages developers to write into metrics only what is necessary, not everything.
To keep metrics from taking up extra memory in the application and rendering them quickly, we have to aggregate them. No one wants to store details about each request in memory because this data will soon occupy all memory. We must limit ourselves to the minimum; for example, we must keep counters of requests and errors. This ultimately reduces the load on the infrastructure. It needs to pick up and store already maximally compressed data, which the application developers have already considered.
Of course, push can also be prepared similarly, but it is much easier to send everything in a row and then sort it out.
I chose the pull way. Prometheus was built mainly around this model, and it was easier for me to choose a solution with centralized management.
Metrics need to be stored somewhere and then sampled. Specialized databases solve this task: Time Series Database.
TSDB feature is the processing of time series, i.e., uniform measurements in time. Databases of this type optimize the storage of some number that is recorded in equal intervals of time. It is easier to understand by example: to collect daily air temperature, we need to store something like [(day1, t1), (day2, t2), ...]
and nothing else.
Important: TSDBs are needed to store a time and 1 number bound to that time. Then time and a number again, and so on.
Specifics:
Since all this does not look like SQL, supporting a complex query language is unnecessary. You can make your own metrics-oriented, simple language.
We used InfluxDB for metrics since it has SQL-like queries. But at our volumes, it just exploded, and high availability was bad, so we gave up on it.
There are specialized data formats for working with TSDB, which are easier than SQL. In our case, TSDB Prometheus is organized in such a way that the metrics format and the query language are almost the same!
Once we have a database to store the metrics and a way to deliver data to it, we can write queries, but... It's no fun just looking at tables of numbers. That's why Grafana usually makes queries to the database. It parses tables from answers and draws convenient graphs. You can customize the scale and color and add other decorations to it. But you will still have to write queries yourself.
From now on, we will deal with Prometheus queries without the nuances that Grafana adds. That's why we won't need it at all to start with - Prometheus has a simple web interface that visualizes query results. It will suffice for training, testing, and debugging.
The most useful thing that can be squeezed out of metrics and graphs is alerts. We don't need a live person who constantly monitors free disk space or the number of HTTP 500 responses. It is possible to set up automation that reacts to graphs exceeding the acceptable limits and sends out notifications. But to get to it, you will have to first learn how to collect and store metrics, then request them and display them on charts, and only then set up alerts.
I will not discuss alerts because this topic is not as difficult to understand conceptually as metrics. Moreover, the solution's architecture will depend very much on the specifics. Some people need a messenger newsletter, others a Telegram bot, and so on.
I use Prometheus to work with metrics. It includes:
First, we will understand how the server works and stores data, then look at the format of exporting metrics from applications, and then learn how to write queries. Prometheus is so organized that the metrics format and the query language are very similar, so you won't have to suffer too much.
We as users need to know that:
GET /metrics
to our application. This is called Scrape. The request interval can be any. You can customize the endpoint and replace HTTP with something else. In the examples below, we will assume that the scrape is done once every 30 seconds.
The idea is that Prometheus collects a slice in time, and then we use its tools to calculate changes.
That is, we can count 10 HTTP requests in 30 seconds. The scraper came, and we gave it that data. What happens next:
Correct |
Incorrect |
---|---|
We continue incrementing the same counter further |
Resetting the counter |
Yes, technically, you can reset the counter, but you don't want to do that! The application will forever accumulate an increasing counter (monotonically increasing). This property will come in handy later.
Roughly speaking, Prometheus will take a derivative to find out "how many requests came in the last minute". By the way, derivatives are not a big deal. A detailed and clear explanation will be given later.
Prometheus comes by default for metrics without authentication. The problem arises: how to close access to them so that it is impossible to see anything? This is especially important if the application is accessible from the outside, from the Internet. There are different options:
/metrics
on a separate port that is not exposed to the outside
Great, move on.
Code doesn't always live as an application with an HTTP server. Sometimes, it is a service without HTTP, some RabbitMQ queue handler, or a cronjob in general that starts on a timer, works, and dies.
The easy way to build metrics in these cases is to decide that the overhead of adding an HTTP server just for the sake of /metrics
doesn't scare you. That's fine, but it won't help cronjobs that don't live as a persistent process and can't store and give metrics every 30 seconds. That's why there are options for arranging metrics collection to bypass the pull model. You'll have to bring up an auxiliary service of your choice:
On the side of our code, we usually need to connect and configure a library for collecting metrics. It will already aggregate, format, and give the page with metrics. In some tech stacks, the libraries themselves will make friends with your web server. Somewhere, you will have to take action. The basic idea is that libraries for working with metrics provide an API through which you must register and describe a metric and then update it from anywhere in the application.
For example, increase the metric "number of requests to this endpoint" when receiving an HTTP request. When sending a response, increase the metric "request processing time". Now, it's time to understand what metrics are from the Prometheus point of view and how they are updated from the code. This way, we will understand which methods to use and which metrics are better suited for certain tasks.
The format in which metrics are written by the application and given to Prometheus from the database is simple enough and made to be easily read by eyes. Counting and formatting metrics from the application doesn't need to be done manually. There are libraries for that. This is how the page that the application should give to GET /metrics
looks like:
# HELP http_requests_total Requests made to public API
# TYPE http_requests_total counter
http_requests_total{method="POST", url="/messages"} 1
http_requests_total{method="GET", url="/messages"} 3
http_requests_total{method="POST", url="/login"} 2
What's here:
HELP
- description to help peopleTYPE
- type of metrichttp_requests_total
- name of the metric
Storage works like this: the name of a metric is actually a label with the name __name__
. All labels together describe a time series, i.e. it is like a table name made up of all key-values. In this series lie the values [(timestamp1, double1), (timestamp2, double2), ...]
. From the example above, we have one metric, but there are three tables in the database: for GET /messages
, POST /messages
and POST /login
. In each table, every 30 seconds, the next number that the application showed at the moment of scrape is written.
The doubles are stored in time. No ints. No strings. No additional information. Just numbers!
By the way, it is useful to look up naming practices in the documentation. Labels are used for searching and aggregation, but there is one peculiarity. The server will be bad if you use unique or rarely repeated label values. This is because…
Each new label value is already a new time series. That is a new table. Therefore, you should not abuse them. A good label is limited in possible values. That is, it is bad to write the whole User-Agent
there, but the name and major version of the browser are okay. User-Name, if there are hundreds of them - doubtful, but if there are tens of them - OK (API clients of an internal service, for example).
For example, we write metrics about HTTP requests. We multiply all possible values of all labels: 2 HTTP verbs, 7 URLs, 5 service replicas, 3 response types (2xx, 3xx, 4xx), 4 browsers. 840 time series! I mean, it's like 840 tables in SQL. Prometheus can handle tens of millions of rows, but combinatorial explosions can quickly be organized. You can read more here: Cardinality is key.
In general, don't hesitate to write what you really need, but don't villainize. Keep an eye on Prometheus' resource consumption and make sure that labels don't contain arbitrary text.
Before writing a metric, think about what form it will be displayed in. You don't need a graph with dozens of colorful lines dancing on it, so it is useless to write down the exact user-agent. You will still want to group it into something meaningful. On the other hand, the same metric can be grouped under different labels and drawn on different graphs. If you, for example, count HTTP requests and store method, client ID, and response code in labels, this metric can already be displayed in different ways: HTTP requests by client
, HTTP requests by method and response code
.
Even though metrics have a TYPE
field - there is no difference “under the hood”. This, like HELP
, is just for people to make it easier to work with. However, the libraries we write metrics with are built around these types. Some functions in queries only work correctly for certain types. So, you can think of a type as a convention for how the value of that metric behaves.
Further in the text, API is an average name of methods from Prometheus libraries for different languages. Just for illustration purposes. There are, of course, variants. For example, the library for dotnet App Metrics has slightly different names and methods, but the essence has not changed.
The counter is a monotonically increasing number. It never decreases! It can be reset to zero, for example, during restarts of the service that writes metrics. This is important because Prometheus has special functions that take this into account. API: increase()
, add(x)
.
How do we know how many requests there were per unit time when we only have one number? Look at the delta since Prometheus saves snapshots of this number every 30 seconds. Well, you will need an additional workaround if the application restarted and the counter suddenly reset to zero. This is already taken into account in the functions that work with counters.
Gauge is a number that can walk up and down. API: setValue(x)
, increase()
, decrease()
.
Since it isn't monotonic, some math tricks won't work, meaning it's a little more limited in its use. What kind of tricks? I will discuss this later.
Histogram is the aggregation of something by the application itself when we want to know the distribution of values into predefined groups (buckets). API: observe(x)
.
For example, we want to know the duration of HTTP requests. Let's define which times to consider good, which to consider bad, and how much detail we want to know. I can say a qualitative distribution:
<= 0.1
sec - is a good request; I expect the majority of such requests<= 1
- ok, but it would be better to know that they occur<= 5
- suspicious, let's go and look at the code if there are a lot of such requestsmore than 5
- bad at all; for the sake of uniformity, I can say that it is <= infinity
How it works. Once a request arrives, we measure the processing time X
and update the histogram. We add +1
to the corresponding groups and add +X
to the total time. Here are some examples of requests with different times hitting the bucket:
0.01
will hit all buckets: <= 0.1
, <= 1
, <= 5
, <= infinity
0.3
will hit all buckets except the first one: <= 1
, <= 5
, <= infinity
; it will not hit the first one because the time is greater than 0.1
4
will get into the buckets: <= 5
, <= infinity
; it will not get into the first and second because the time is greater than 0.1
and 1
10
will fall only into the bucket <= infinity
; it will not get into the other bucket because the time is greater than 0.1
, 1
and 5
The histogram counts the number of hits in some group, i.e., it memorizes the counters, not the values themselves! We are, after all, limited by the fact that the metric itself is only one number. Each bucket is like a separate metric.
How do we use it? You can simply plot the desired bucket by dividing it by the count
: we get the ratio of this bucket to all queries, i.e., the share of "good" or "bad" queries in the total mass, depending on what we want to observe. But it is better not to do it by hand but to aggregate it into quantiles with one function. It is convenient and simple and will be calculated on Prometheus-server, though with a loss of accuracy (less bucket, less accuracy). If you want to calculate quantiles by yourself or do not know in advance what kind of bucket you need, there is another type - Summary.
Summary - get ready. This is going to be complicated. At first glance, it looks like a histogram, but it is the result of histogram aggregation. It gives out quantiles at once when we can't determine the bucket list in advance. API: observe(x)
.
It's easiest to explain in practice. We usually don't know in advance what to consider a good time for a query and what to consider a bad time. So let's just throw the measured time into Summary, and then see where 95% of the queries fit in. And 50%, and 99% too. So, a request comes in, we measure X
processing time, put it in the Summary:
X time
itself is stored in a set of values in the application memory
You can't just aggregate summaries head-on, but in general you can, if you think with your head, with a loss of accuracy. They also hang in the application's memory because you must memorize a set of values over some time. Because of this, summaries count quantiles with loss: old data is gradually being superseded, so it has less impact on the currently obtained value.
You can take different approaches, such as "window shifting" - throwing out the oldest values. Or throw out random ones. It depends on what you want to see in the metric: statistics on all queries in general or only on recent ones.
I think this information will be enough for you to get started. We will continue discussing metrics in DevOps in the next article.