Caveats in metric collection

I want to discuss different ways of collecting metrics from distributed (microservice) systems.

The two most common ways I’ve seen are structured logging and metric emitting. Let’s dig in.

Structured logging

The idea of structured logging is to produce log messages that have a predefined format, e.g. JSON, and to put some semantics (structure) in the JSON fields of each message.

{"name":"ServiceA","hostname":"5a28d77e1096","pid":1,"level":"info","user":"ivan","msg":"Something is happening.","time":"2017-04-21T07:57:44.712Z"}

Having structured logging in place enables constructing log-based metrics. Log-based metrics allow you to count the number of log entries that match given user defined filters. Example would be all logs that have level field with error value. And with systems like ELK or Stackdriver this is very easy. In fact it is so easy, that people abuse it. Why? Here are several reasons.

Writing tons of data to stdout/stderr has a performance impact on your application.

For small amount of data it may be negligible, but as the data volume increases so does the performance penalty. Also be aware that some log collecting agents might make use of throttling, which can slow down your application even more. Just for reference, haproxy does not support logging to stdin/stderr or to a log file and I’m pretty sure this is one of the reasons for the decision. If you still decide to do it, make sure it is not in the thread handling the hot path of your application.

Writing tons of logs could impact the performance of your logging stack.

Let’s consider the ELK stack. In most cases I’ve seen, ELK has a hidden Q in it. The Q stands for a queue which buffers messages if sudden spikes in traffic arise or the parsing or the storage layers are experiencing turbulence.

And it is not uncommon for the last to happen. After all, this is another distributed system. But for you as a end users, this means that your view for the state of the system is lagging behind. And if there is problem with your services and they produce even more log messages in order to signal for the problem — it’s getting even worse.

Writing tons of logs could impact the underlying system’s network

Transporting a log message requires much more bandwidth than transporting a single metric value. I’ve seen a situation where a performance test for the logging subsystem of a PaaS solution almost bring down whole network segment of the underlying IaaS provider. Not to mention that some of the monitoring tools will accumulate all value changes locally, and only periodically send the delta, which will reduce both the consumed network bandwidth the performance penalty even more. This technique is used for tracing as well [2].

Worse development experience

Not so technical, but very important thing. Having a local development environment that consist of 5+ services, each going crazy on logging is a terrible experience. Even more, you may benefit from the log-based metrics in your staging and/or production environments, but bringing them to development is a hard task. Emitting metrics for things you care solves both problems — you’re able to focus on the important log messages and still, observe the behavior of the system.

Some systems might drop log messages

If you rely on log-based metrics to capture all of those important events, and they become more and more, you might loose some without even getting notified (e.g. Cloud Foundry, [1]). So one more time — log only important and actionable events and use metrics for the other things.

Emitting metrics

Things are more diverse here in terms of technologies, but conceptually you’re capturing your application state and emitting it to a 3rd party system at some point. Following are some points for emitting metrics.

Accumulate changes locally

Since at the end metrics are just numbers, it is possible to accumulate them locally and emit them once in a while. It is possible to lose the accumulated information in case of crash, but this greatly reduces the consumed network bandwidth. It is also possible to do it the other way around — the application exposes the current values and whoever is interested can read them. (This is basically push vs. pull models)

Intensions are explicit in the source code

Collecting metrics requires instrumenting your code. It is a bit of work, but it makes your intention clear, which is not the case with log messages. If someone accidentally changes the format of log message you’re relying on for your log-based metrics, you’ll notice just after all of your running service instances are updated. Having separate code in place for collecting and emitting metric greatly reduces the chance somebody deletes it by accident.

Tracing

Tracing is a hot topic (at least for me) recently, and it surely has a place in making your system more observable. The idea is to trace a request across service boundaries. It is particularly useful when the request is interacting with multiple services. I will not cover any more, but just wanted to mention it. For more details, you can start with [2].

No silver bullet

Both logging and metric emitting have their place. Just make sure next time you go for a particular option you understand the pros and cons and are aware of all alternatives.

[1] https://docs.cloudfoundry.org/devguide/deploy-apps/streaming-logs.html

[2] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf