paint-brush
Exploring DevOps Metrics in Human Terms - Part 1by@annasher
207 reads

Exploring DevOps Metrics in Human Terms - Part 1

by Anna SherDecember 15th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Metrics can be tricky, but fear not! This guide demystifies the world of metrics, Prometheus, and Grafana. Learn the importance of metrics, their types, and how to implement them effectively. From understanding the architecture of metrics collection to crafting insightful queries, this guide has you covered. Whether you're dealing with counters, gauges, histograms, or summaries, Prometheus and Grafana serve as your dynamic duo for data-driven insights. Dive into the world of metrics and emerge with the knowledge to unlock meaningful data visualization.

Company Mentioned

Mention Thumbnail
featured image - Exploring DevOps Metrics in Human Terms - Part 1
Anna Sher HackerNoon profile picture


Once, I needed to implement metrics in my team's services. I didn't understand what I wanted to get from the beginning. It's one thing to use a library and draw graphs and another thing to show meaningful data.


I needed a guide combining these two things: “Why it's done this way” and then “How to do it right”. As a result, I had to write such a guide myself.


Its goal is to explain to developers with any background what metrics are and how to think about them and use them meaningfully. We will deal with Prometheus and Grafana. If you have a different tech stack, it's okay. We will also touch upon fundamental topics: for example, percentiles, derivatives, and cardinality.


My guide will be long:

  1. First, we will look at the architecture: how metrics are collected and where they are stored.
  2. Next, we'll deal with the types of metrics - they're not as simple as they seem.
  3. Then, we'll have to divert a bit to math (but only from an engineering point of view!).
  4. And finally, we'll learn how to write queries, but not just for the sake of it. We'll look at different pitfalls and non-obvious points at once.


Before plugging in libraries, bringing up servers, and writing queries, let's restore justice and find the introduction that I lacked when studying metrics.


Common Sense

You want to collect different metrics "to see everything happening in the services".

Now ask yourself and your team important questions:


  • How do you do it?
  • What exactly do you want to see as a result?
  • How do other people solve this problem?


Study common approaches and find a ready-made tech stack to avoid inventing your own wheel. But there will be “its own atmosphere”. And no colleague or guide explains why it is so strangely organized.


Technically, the task looks something like this: the company already has, for example, Grafana (if not, it's not hard to pick it up). There are a hundred ways to collect and transfer data, but the following questions arise at once:


  • What do we want to display there?
  • Why can't we just put it into SQL?
  • Why do we need Prometheus?


We will deal with all this.


For complex technologies, it is good to have not only a manual but also the right mindset. Then, all the tools will be easier to use.


In general, metrics are unexpectedly complicated. We are trying various engineering and mathematical tricks to compress an inconvenient, large data set into something visual. Unlike logs, which are simply written "as is" and are probably already used in your services. By the way, why not parse logs and build some graphs based on it? Of course, you can do it this way, and it's fine - up to a certain limit.


Metrics vs. logs


Logs are about accuracy: we collect information on every event. If you aggregate them, you can see the big picture and even pull out details down to a single request. But sending logs for every "sneeze" doesn't scale well.


Log traffic depends on the application traffic: more requests, more actions - and there are proportionally more logs. They are often unstructured and difficult to index and parse - you need to pull numbers out of the text for graphs. Traditionally, much unnecessary stuff is written into logs, meaning processing and aggregation will suffer and lag behind in real-time.


Metrics are a general picture of the application state at once. We do not collect all the details, but only a ready squeeze: for example, the number of requests to the service. We will get a picture but cannot get into the details. But this approach scales well and works quickly. We can react instantly and send out alerts. If we need details, we think and add a little bit at a time.


For example, count the number of requests to different endpoints. Metrics traffic depends on their number and the frequency of collecting/sending them to the repository. Finally, we do not write anything unnecessary into metrics, but, on the contrary, we try to write as little as possible.


Writing everything and then thinking about it (at the cost of traffic and processing complexity) vs thinking and writing only what is necessary (at the cost of losing details).


If you don't have many services and don't plan to scale, you can keep it simple and stop with log parsing. The popular ELK stack has built-in visualization tools.


Push vs Pull

Metrics collection can be organized in 2 ways. Each has its pros and cons. The deeper you dig and the larger the scale, the more complex the choice becomes.


Push. The principle is the same as with logs or a database: we write data to the storage when an event occurs. Events can be sent in batches. The method is easy to implement from the client's point of view.


Pull. Here, everything is vice versa. Applications store compact data in memory, like processed requests: 25. Someone periodically comes and collects them, for example, via HTTP. From the client's point of view, pull is more difficult to implement but easier to debug. Each application has an endpoint where you can go with your browser and see what is happening in that particular application. There is no need to write tricky queries, filter data by replicas, and generally have access to a common metrics repository. In addition, this model encourages developers to write into metrics only what is necessary, not everything.


To keep metrics from taking up extra memory in the application and rendering them quickly, we have to aggregate them. No one wants to store details about each request in memory because this data will soon occupy all memory. We must limit ourselves to the minimum; for example, we must keep counters of requests and errors. This ultimately reduces the load on the infrastructure. It needs to pick up and store already maximally compressed data, which the application developers have already considered.


Of course, push can also be prepared similarly, but it is much easier to send everything in a row and then sort it out.


I chose the pull way. Prometheus was built mainly around this model, and it was easier for me to choose a solution with centralized management.


TSDB (Time Series Database)


Metrics need to be stored somewhere and then sampled. Specialized databases solve this task: Time Series Database.


TSDB feature is the processing of time series, i.e., uniform measurements in time. Databases of this type optimize the storage of some number that is recorded in equal intervals of time. It is easier to understand by example: to collect daily air temperature, we need to store something like [(day1, t1), (day2, t2), ...] and nothing else.


Important: TSDBs are needed to store a time and 1 number bound to that time. Then time and a number again, and so on.


Specifics:

  • Relational skills are minimal: if SQL, it's limited, if not non-existent
  • Types of stored data are cut down
  • Optimization for constant continuous writing


Since all this does not look like SQL, supporting a complex query language is unnecessary. You can make your own metrics-oriented, simple language.


We used InfluxDB for metrics since it has SQL-like queries. But at our volumes, it just exploded, and high availability was bad, so we gave up on it.


There are specialized data formats for working with TSDB, which are easier than SQL. In our case, TSDB Prometheus is organized in such a way that the metrics format and the query language are almost the same!


Visualization

Once we have a database to store the metrics and a way to deliver data to it, we can write queries, but... It's no fun just looking at tables of numbers. That's why Grafana usually makes queries to the database. It parses tables from answers and draws convenient graphs. You can customize the scale and color and add other decorations to it. But you will still have to write queries yourself.


From now on, we will deal with Prometheus queries without the nuances that Grafana adds. That's why we won't need it at all to start with - Prometheus has a simple web interface that visualizes query results. It will suffice for training, testing, and debugging.


Alerts

The most useful thing that can be squeezed out of metrics and graphs is alerts. We don't need a live person who constantly monitors free disk space or the number of HTTP 500 responses. It is possible to set up automation that reacts to graphs exceeding the acceptable limits and sends out notifications. But to get to it, you will have to first learn how to collect and store metrics, then request them and display them on charts, and only then set up alerts.


I will not discuss alerts because this topic is not as difficult to understand conceptually as metrics. Moreover, the solution's architecture will depend very much on the specifics. Some people need a messenger newsletter, others a Telegram bot, and so on.


Prometheus: Server and Clients


I use Prometheus to work with metrics. It includes:


  • Server - storer and collector of metrics
  • Data format
  • Query language - also called PromQL


First, we will understand how the server works and stores data, then look at the format of exporting metrics from applications, and then learn how to write queries. Prometheus is so organized that the metrics format and the query language are very similar, so you won't have to suffer too much.


Scrape and Counting Metrics

We as users need to know that:


  • The task of the application is to expose an HTTP page with its metrics in a specific format.
  • The server periodically makes an HTTP request to GET /metrics to our application. This is called Scrape. The request interval can be any. You can customize the endpoint and replace HTTP with something else. In the examples below, we will assume that the scrape is done once every 30 seconds.
  • The server saves the application's response to the database with the current timestamp.
  • The application metrics remain in the application memory. We augment them and aggregate them further.


The idea is that Prometheus collects a slice in time, and then we use its tools to calculate changes.


That is, we can count 10 HTTP requests in 30 seconds. The scraper came, and we gave it that data. What happens next:

Correct

Incorrect

We continue incrementing the same counter further

Resetting the counter


Yes, technically, you can reset the counter, but you don't want to do that! The application will forever accumulate an increasing counter (monotonically increasing). This property will come in handy later.


Roughly speaking, Prometheus will take a derivative to find out "how many requests came in the last minute". By the way, derivatives are not a big deal. A detailed and clear explanation will be given later.

Security

Prometheus comes by default for metrics without authentication. The problem arises: how to close access to them so that it is impossible to see anything? This is especially important if the application is accessible from the outside, from the Internet. There are different options:


  • Configure authentication in Prometheus itself: it will send requests with the headers you need
  • Host endpoint /metrics on a separate port that is not exposed to the outside
  • Configure a firewall
  • Make whitelist at the application level


Great, move on.

Alternative ways of scraping

Code doesn't always live as an application with an HTTP server. Sometimes, it is a service without HTTP, some RabbitMQ queue handler, or a cronjob in general that starts on a timer, works, and dies.


The easy way to build metrics in these cases is to decide that the overhead of adding an HTTP server just for the sake of /metrics doesn't scare you. That's fine, but it won't help cronjobs that don't live as a persistent process and can't store and give metrics every 30 seconds. That's why there are options for arranging metrics collection to bypass the pull model. You'll have to bring up an auxiliary service of your choice:


  • Pushgateway. It is a component of Prometheus, which lives as an intermediate application. You can send your metrics to it, and Pushgateway will already distribute them to Prometheus.
  • Telegraf. It is a universal converter and aggregator of metrics, which you will also have to keep running all the time. You can configure it to collect and retrieve metrics in any convenient way. It can filter and convert the received data, and Prometheus will take the result from it.
  • StatsD Exporter. The application should send metrics in statsd format to it, and it will export them to Prometheus. The conceptual difference is only in the format. You will have to keep it running all the time.


On the side of our code, we usually need to connect and configure a library for collecting metrics. It will already aggregate, format, and give the page with metrics. In some tech stacks, the libraries themselves will make friends with your web server. Somewhere, you will have to take action. The basic idea is that libraries for working with metrics provide an API through which you must register and describe a metric and then update it from anywhere in the application.


For example, increase the metric "number of requests to this endpoint" when receiving an HTTP request. When sending a response, increase the metric "request processing time". Now, it's time to understand what metrics are from the Prometheus point of view and how they are updated from the code. This way, we will understand which methods to use and which metrics are better suited for certain tasks.


Prometheus: data format

The format in which metrics are written by the application and given to Prometheus from the database is simple enough and made to be easily read by eyes. Counting and formatting metrics from the application doesn't need to be done manually. There are libraries for that. This is how the page that the application should give to GET /metrics looks like:


# HELP http_requests_total Requests made to public API
# TYPE http_requests_total counter
http_requests_total{method="POST", url="/messages"} 1
http_requests_total{method="GET", url="/messages"} 3
http_requests_total{method="POST", url="/login"} 2


What's here:


  • HELP - description to help people
  • TYPE - type of metric
  • http_requests_total - name of the metric
  • set of key-value labels
  • metric value (64-bit float aka double)
  • after the collection, another timestamp is added to the database


Storage works like this: the name of a metric is actually a label with the name __name__. All labels together describe a time series, i.e. it is like a table name made up of all key-values. In this series lie the values [(timestamp1, double1), (timestamp2, double2), ...]. From the example above, we have one metric, but there are three tables in the database: for GET /messages, POST /messages and POST /login. In each table, every 30 seconds, the next number that the application showed at the moment of scrape is written.


The doubles are stored in time. No ints. No strings. No additional information. Just numbers!


By the way, it is useful to look up naming practices in the documentation. Labels are used for searching and aggregation, but there is one peculiarity. The server will be bad if you use unique or rarely repeated label values. This is because…

Cardinality

Each new label value is already a new time series. That is a new table. Therefore, you should not abuse them. A good label is limited in possible values. That is, it is bad to write the whole User-Agent there, but the name and major version of the browser are okay. User-Name, if there are hundreds of them - doubtful, but if there are tens of them - OK (API clients of an internal service, for example).


For example, we write metrics about HTTP requests. We multiply all possible values of all labels: 2 HTTP verbs, 7 URLs, 5 service replicas, 3 response types (2xx, 3xx, 4xx), 4 browsers. 840 time series! I mean, it's like 840 tables in SQL. Prometheus can handle tens of millions of rows, but combinatorial explosions can quickly be organized. You can read more here: Cardinality is key.


In general, don't hesitate to write what you really need, but don't villainize. Keep an eye on Prometheus' resource consumption and make sure that labels don't contain arbitrary text.


Before writing a metric, think about what form it will be displayed in. You don't need a graph with dozens of colorful lines dancing on it, so it is useless to write down the exact user-agent. You will still want to group it into something meaningful. On the other hand, the same metric can be grouped under different labels and drawn on different graphs. If you, for example, count HTTP requests and store method, client ID, and response code in labels, this metric can already be displayed in different ways: HTTP requests by client, HTTP requests by method and response code.

Types of metrics


Even though metrics have a TYPE field - there is no difference “under the hood”. This, like HELP, is just for people to make it easier to work with. However, the libraries we write metrics with are built around these types. Some functions in queries only work correctly for certain types. So, you can think of a type as a convention for how the value of that metric behaves.


Further in the text, API is an average name of methods from Prometheus libraries for different languages. Just for illustration purposes. There are, of course, variants. For example, the library for dotnet App Metrics has slightly different names and methods, but the essence has not changed.

Counter

The counter is a monotonically increasing number. It never decreases! It can be reset to zero, for example, during restarts of the service that writes metrics. This is important because Prometheus has special functions that take this into account. API: increase(), add(x).


How do we know how many requests there were per unit time when we only have one number? Look at the delta since Prometheus saves snapshots of this number every 30 seconds. Well, you will need an additional workaround if the application restarted and the counter suddenly reset to zero. This is already taken into account in the functions that work with counters.

Gauge

Gauge is a number that can walk up and down. API: setValue(x), increase(), decrease().


Since it isn't monotonic, some math tricks won't work, meaning it's a little more limited in its use. What kind of tricks? I will discuss this later.

Histogram


Histogram is the aggregation of something by the application itself when we want to know the distribution of values into predefined groups (buckets). API: observe(x).


For example, we want to know the duration of HTTP requests. Let's define which times to consider good, which to consider bad, and how much detail we want to know. I can say a qualitative distribution:


  • <= 0.1 sec - is a good request; I expect the majority of such requests
  • <= 1 - ok, but it would be better to know that they occur
  • <= 5 - suspicious, let's go and look at the code if there are a lot of such requests
  • more than 5 - bad at all; for the sake of uniformity, I can say that it is <= infinity


How it works. Once a request arrives, we measure the processing time X and update the histogram. We add +1 to the corresponding groups and add +X to the total time. Here are some examples of requests with different times hitting the bucket:


  • 0.01 will hit all buckets: <= 0.1, <= 1, <= 5, <= infinity
  • 0.3 will hit all buckets except the first one: <= 1, <= 5, <= infinity ; it will not hit the first one because the time is greater than 0.1
  • 4 will get into the buckets: <= 5, <= infinity ; it will not get into the first and second because the time is greater than 0.1 and 1
  • 10 will fall only into the bucket <= infinity ; it will not get into the other bucket because the time is greater than 0.1, 1 and 5


The histogram counts the number of hits in some group, i.e., it memorizes the counters, not the values themselves! We are, after all, limited by the fact that the metric itself is only one number. Each bucket is like a separate metric.


How do we use it? You can simply plot the desired bucket by dividing it by the count: we get the ratio of this bucket to all queries, i.e., the share of "good" or "bad" queries in the total mass, depending on what we want to observe. But it is better not to do it by hand but to aggregate it into quantiles with one function. It is convenient and simple and will be calculated on Prometheus-server, though with a loss of accuracy (less bucket, less accuracy). If you want to calculate quantiles by yourself or do not know in advance what kind of bucket you need, there is another type - Summary.

Summary


Summary - get ready. This is going to be complicated. At first glance, it looks like a histogram, but it is the result of histogram aggregation. It gives out quantiles at once when we can't determine the bucket list in advance. API: observe(x).


It's easiest to explain in practice. We usually don't know in advance what to consider a good time for a query and what to consider a bad time. So let's just throw the measured time into Summary, and then see where 95% of the queries fit in. And 50%, and 99% too. So, a request comes in, we measure X processing time, put it in the Summary:


  • +1 to the request count
  • X time itself is stored in a set of values in the application memory
  • Recalculate the quantiles
  • Periodically, we will have to dump old values from memory so as not to waste them endlessly


You can't just aggregate summaries head-on, but in general you can, if you think with your head, with a loss of accuracy. They also hang in the application's memory because you must memorize a set of values over some time. Because of this, summaries count quantiles with loss: old data is gradually being superseded, so it has less impact on the currently obtained value.


You can take different approaches, such as "window shifting" - throwing out the oldest values. Or throw out random ones. It depends on what you want to see in the metric: statistics on all queries in general or only on recent ones.

Conclusion

I think this information will be enough for you to get started. We will continue discussing metrics in DevOps in the next article.