215 reads

“Trust But Verify” Your Metrics

by Mahesh Paolini-SubramanyaSeptember 24th, 2018

Too Long; Didn't Read

Let’s take for granted that you’ve done the right thing — you’ve generously instrumented your system, and are actually paying attention to the metrics that you’ve generated (•). The question on the table is — “<em>Do you actually trust the metrics that you are generating?</em>” (Hint: You shouldn’t)

Company Mentioned

Coin Mentioned

featured image - “Trust But Verify” Your Metrics

Let’s take for granted that you’ve done the right thing — you’ve generously instrumented your system, and are actually paying attention to the metrics that you’ve generated (•). The question on the table is — “Do you actually trust the metrics that you are generating?” (Hint: You shouldn’t)

Let’s look at something fairly straightforward, the request/response path as shown below

You would think that the ResponseTime would be the sum of each of the processing stages, right? i.e., ResponseTime = 10 + 1 + 20 + 1 + 5 = 37ms?

But, since you shouldn’t trust your metrics, you

Also measured ResponseTime directly, and
Compared it against the what it should be, and
Charted/alerted on deviationsand you found that the actual ResponseTime was, say, 52ms.

That’s quite a difference, no? As to Why it was 52ms, let’s look at a bunch of possible issues

Measuring the Wrong Thing: You actually instrumented something completely different. I know, that sounds goofy, but it happens all the time, e.g. you’re measuring the validate_user interval instead of validate_users (spelling issues with APIs. yay.)
Incomplete Instrumentation: Simply put, you missed something. e.g. There’s a queue in front of the Analyze component that you haven’t instrumented, and you’re not measuring the latency there.
System Issues: Oops. A garbage collection pause. Or a failover. Or a restart. Or whatever.
Unexpected Code Paths: Your code has a bunch of paths in it to deal with edge-cases (e.g “strip semi-colons from the input”), and some of these trigger additional steps that you had forgotten about
Time Issues: You just plain screwed up by making one of the — infinitely many — assumptions about time, such as that it increases monotonically, or everything is GMT, or whatever.

And this is just when it comes to measuring time. The point here being that you should be validating your metrics through multiple means, for all your metrics. In fact, if you are already doing this, and all the numbers line up, you should be very very worried — you’ve probably missing something!

So yeah, trust your metrics, after you’ve verified them…

(•) You’d be surprised how often I see this missed. “Did you instrument your code?” — “D-uh, of course!”“Grafana?” — “Dude, come on, what d’you think I am?”“When was the last time you looked at it?” — “Uhhhhh”

(This article also appears on my blog)