In this article, I am going to shed light on one important aspect of observability - tracing.
If Metrics and Logs are pretty clear terms and we considered them in the previous articles, tracing is a new and modern term, and probably a bit harder to grasp.
Actually, it is not hard to understand the term, I am going to show you this. On the other hand, it is complicated to make an implementation or introduce it into an already running system(of course, if your system is more complex than two simple virtual machines).
strace
is a tool for tracking which calls are used by a process, it shows the time taken by each syscall, arguments, and return values. Many sysadmins, and developers are familiar with this tool because it is very useful in everyday life.
Compared to strace
, tracing in observability is similar. But it is a bit more complicated, strace
is used for observability of an application (inside), while tracing is used in distributed systems. That is a difference.
Traces usually look like logs, these data structures contain the time spent for each unit, the unique ID of the trace, and so on.
Let’s consider a simple example. We can use a client-server system. One server sends requests (client) and the other replies with something over the network. It does not matter what protocol is used for communicating between the client and server. It may be binary or HTTP protocol etc.
This is shown in pic 1 below.
You can build such a system by launching the following on a server-side:
$ nc -l 10000
and on a client:
$ nc <server> 10000
What happens here?
The client sends requests over the network, the server receives, processes (actually, in the beginning, it parses the request), then prepares a reply and sends it. This is a very clear process.
But processing requests on the server side takes time, if the number of connected clients is significantly increased, the time for processing is also increased.
In this case, the client might not receive anything and close the connection by timeout. Sometimes such situations occur. This is the typical client-server architecture of any modern application.
As DevOps professionals, we should track such behaviors and take preventive/corrective measures. We should understand the architecture and where a bottleneck may occur.
Let’s have a look at picture 2 below.
At the momentt0
the server receives a request. And at the moment t1
a reply is being sent to the client. We don’t take into consideration network lags and so on.
t1-t0
- is normal time spent on processing. In picture 1, we only see one simple scheme. Of course, modern distributed and microservices architecture have many more elements. This is the truth for all complex systems, not just microservice applications.
So, we considered a pretty simple example, where we had only one server and one client. Modern systems contain different components. Just for clarification, let’s take a simple web application. It contains a frontend (web server which serves static files and passes files to an interpreter like python or php). Also, it contains a backend and at least a database, let’s say PostgreSQL.
This system is presented in pic 3 below.
What do we have here?
We see three components and each of them brings a time latency in processing the client's request.
I suggest redrawing the picture to get a better understanding of the request.
At the moment:
t0
- frontend receives the client’s request
t1
- frontend sends a request to a backend
t2
- backend receives a request
t3
- backend sends a request(SQL query) to a database
t4
- database receives a query
t5
- database returns a reply to the backend
t6
- backend receives a reply from the database
t7
- backend returns a response to the frontend
t8
- frontend receives a response from the backend
t9
- frontend gives back a response to the client
In other words, interval
t0 - t1
- client request processing time in frontend
t2 - t3
- client request processing time in the backend
t4 - t5
- query processing time in the database management system
t6 - t7
- database query processing time in the backend
t8 - t9
- backend response processing time in the frontend
or:
t2 - t7
-time the backend spends for processing the frontend’s query, including query to the DMS (database management system), let’s say 200 ms
t0 - t9
- total time the system spends handling client’s request, for example, 500 ms
That is, a database query takes 300 ms.
The frontend can be a web server with any interpreter communicating via CGI or any other interface therefore we can break down the frontend into two parts. And all this can be applied to the backend. How small a part can be, depends on your requirements, there is no versatile advice.
Those parts of the system into which it can be divided to measure processing time, in terms of tracing, are called a span. Span is a unit of any data processing. A set of spans is a trace. The representation of traces can be described as a directed acyclic graph(DAG) and may have forks.
Thus, there are 3 spans in pic. 4:
1-1’
- root span2-2’
- child of span 1-1’
3-3’
- child of span 2-2’
, which indicates job inside the databaseIf you don’t have the necessary telemetry data, I mean if your application does not produce this, your developers don’t know how to implement this (I understand it’s an additional load for developers, they should dig deep inside many things, and implement their own protocol, etc), there are open source solution which can be embedded in your code and your code will generate necessary information.
There are two different open-source standards
Opentelemetry (https://opentelemetry.io)
Opentracing (https://opentracing.io/)
Recently Opentracing was considered archived therefore we have only one - OpenTelemetry (or OTel for short).
Here is a list of popular open-source solutions for tracing:
Name |
Standard |
Language |
Homepage |
---|---|---|---|
Jaeger |
OTel |
Golang | |
Zipkin |
Opentelemetry |
Java | |
SigNoz |
OTel |
GoLang | |
Sentry |
Own implementation |
Python | |
Elastic APM |
OTel |
Java |
https://www.elastic.co/observability/application-performance-monitoring |
If you have sources of an application, even if you can’t rewrite it for some reason, or you use a 3rd party application for which you don’t have sources (like a black box), and you want to make a system observable, it must be instrumented: that is, the application must give specific information.
To collect telemetry traces, there are two approaches:
You use automatic instrumentation aka service mesh. This approach is applied to your application if you don’t have sources, your application is like a black box, and you don’t know the inner implementation. And if your app is already developed. For this purpose, OTell offers agents or special extensions (depending on the language).
You modify your code and add specific OTel API calls. This approach implies adding functions in your code and these functions will generate tracing output.
What else do you need to build a system for tracking traces?
At the very least you need storage for the telemetry information you receive from the application(s). And a visualization application, which will show you traces. Above, it was shown that there were many ready-to-use open-source applications, which contain storage, and a nice graphical user interface. Of course, using OTel you can build your own system or use the system you already have. But it’s not a trivial task.
In the beginning, once you start your DevOps career, you have lots of questions and sometimes it’s difficult to figure out how to bring new technologies into the infrastructure, and how to improve it. There is too much information and it’s pretty easy to get confused.
In the article, we have considered one of the important things - tracing in observability. As it was shown, tracing is an approach for gaining visibility in your distributed system. Nowadays, many organizations have microservices architecture and tracing allows identifying performance bottlenecks in the distributed systems, and to understand strange behavior. It is useful in troubleshooting too.
At the least, tracing facilitates your work and gives you more flexibility, reducing time for searching issues in large systems. It becomes extremely crucial in modern IT infrastructures, in which it is not enough to have only monitoring.
Hope this article is useful and will help in understanding the concepts of tracing.
Lead image generated with stable diffusion.