In this article, I am going to discuss and clarify the difference between observability and monitoring in modern computer systems. At the beginning of the computer era, computers were giant, didn't have networks, and so on. And computers were used periodically for scientific computations.; there was only one operator etc. But then with the ages, computers became distributed and connected over networks. These days we have personal computers, and servers, and all of them are connected to the network. As a server should run independently and uninterruptible, someone should check availability from time to time.
It is very important for servers because we all want to buy something on the internet, read news, and so on, and not see 500 errors in a browser. When you have one, two, three or so servers, it is not a problem to check manually. Regarding the accessibility of a server, we should check such parameters as load average, amount of memory, etc. For this purpose, there are tools : top
, atop
, htop
, iftop
… All these tools use info from /proc
filesystem to fetch necessary info on a server's state.
Pic 1. An example to watch the server’s state-top
You can use a few displays to track this info in real-time. Sometimes you want to look in the past to compare what behavior was and how a server behaves now. Then computers require not only operators but also a special dedicated person – an admin. Then the admin decides to store data in files or databases and automates this process … What happens next? We came up with the idea of monitoring.
All fetched parameters we can store in a database as values, we name them metrics. If we store metrics in the database, we can create graphs and look into the past. Administrators want to fetch and store metrics automatically, so monitoring systems appeared in our lives, then monitoring agents were introduced, and there were Nagios, newrelic, Zabbix, then Prometheus monitoring systems. How does this work? An agent is installed on a host and then this agent sends data to a server where a database works (or vice versa when the server requests metrics from a monitored server). To prevent looking after the fleet of servers constantly on a display, a new term comes – triggers, and now we set a threshold of a value, when the trigger is firing we see an alert. The next step is a notifying system. People who maintain servers want to solve problems only when they come. Problems arise and then the monitoring system notifies the involved people about an issue. Messages can go through the mail, and messengers like slack, telegram, SMS, etc. The entire pipeline looks like the following:
Pic 2. Internals of a monitoring
That's nice because we can foresee behavior, load, and memory consumption. Monitoring has been a basic system to track the health of servers for years. What metrics we can fetch from the server:
Some of the metrics can be extended by scripts (bash, python, etc.) if we are talking about non-standard metrics of a custom application. In other words, metrics are whatever can be measured.
At this moment, we can watch hundreds of servers, create graphs, look at the past, and predict the future (yes! There are special functions that allow predicting values or estimating the time of a value to be exceeded). For the entire picture, there are dashboards uniting graphs, triggers, reports, and so on. We use monitoring to aim for the desired state of a server (enough memory to prevent killing processes; enough high CPU usage to prevent idling). But not an application.
Now the development of computer systems and applications demands another approach. In terms of business, we want to understand how healthy applications, which earn money for us. Just imagine, our company has a set of applications: one processes images of the Earth’s atmosphere, others compute equations, and send weather forecasts to farmers, of course, these forecasts should be done by a specific time. No need to send forecasts later than the point of forecast time. This is a complex system and understanding where the system failed is the most important thing for us. As we discussed above, we can fetch server health metrics, but we don’t observe the application and its behavior. The application is working, we can see the process in the OS, but we cannot see how efficient the application is. We want to get information on how the application works and notify a responsible person if some metric is out of the limit. This is an idea of observability. This is server health monitoring plus getting applications metrics.
Pic 3. Observability is a set of Monitoring, Logging, and Tracing
There are several different components to observe the system. The first thing is to include log information (aka events) to build an observable system. Logs contain a huge amount of information about application health, efficiency, etc. Logs are text time-stamped output of an application. There is a convenient way to aggregate them by timestamp (but it is not a strict requirement). If we consider the system that sends SMS to end users, the application can fail due to inaccessibility of mobile network operator API, or due to the application sending SMS too often and so on, and this happens from time to time, it is not a constant problem, like a hidden bug and depends on various conditions. Also, there might be bugs in the application itself which we didn't catch at the testing stage. There are many reasons to fail and we can find out why – read logs. These two things unite in terms of observability. But it's not all things usually included in observability. As we considered above, except for the application which sends SMS to end users, some applications get data from satellites, process data, then compute differential equations and ingest data to a database, and so on. Many components work as a backend for preparing forecasts. Here, to identify where a problem arose, we should understand the architecture of the application. How data flows from satellite images to real forecast data is a tracing - the third unit of observability. It doesn't matter if either your infrastructure is based on microservices or somehow else.
Understanding this helps us to find performance bottlenecks, and problems in distributed applications and systems, especially in cloud-native applications. Observability in DevOps is a very important thing if we build CI/CD Now we have a response to a question of what observability is and what the difference is regarding the monitoring. In other words, observability is a set of monitoring, tracing, and logging. We have observed a system that has metrics, and logs which can be aggregated in special applications and tracing. Monitoring complements observability, so an observed system cannot be built without monitoring.
Further aspects of observability will be covered in the next articles.