paint-brush
“Trust But Verify” Your Metricsby@dieswaytoofast
215 reads

“Trust But Verify” Your Metrics

by Mahesh Paolini-SubramanyaSeptember 24th, 2018
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Let’s take for granted that you’ve done the right thing — you’ve generously instrumented your system, and are actually paying attention to the metrics that you’ve generated (•). The question on the table is — “<em>Do you actually trust the metrics that you are generating?</em>” (Hint: You shouldn’t)

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - “Trust But Verify” Your Metrics
Mahesh Paolini-Subramanya HackerNoon profile picture

Let’s take for granted that you’ve done the right thing — you’ve generously instrumented your system, and are actually paying attention to the metrics that you’ve generated (•). The question on the table is — “Do you actually trust the metrics that you are generating?” (Hint: You shouldn’t)

Let’s look at something fairly straightforward, the request/response path as shown below

You would think that the ResponseTime would be the sum of each of the processing stages, right? i.e., ResponseTime = 10 + 1 + 20 + 1 + 5 = 37ms?

But, since you shouldn’t trust your metrics, you

  1. Also measured ResponseTime directly, and
  2. Compared it against the what it should be, and

  3. Charted/alerted on deviationsand you found that the actual ResponseTime was, say, 52ms.

That’s quite a difference, no? As to Why it was 52ms, let’s look at a bunch of possible issues

  1. Measuring the Wrong Thing: You actually instrumented something completely different. I know, that sounds goofy, but it happens all the time, e.g. you’re measuring the validate_user interval instead of validate_users (spelling issues with APIs. yay.)
  2. Incomplete Instrumentation: Simply put, you missed something. e.g. There’s a queue in front of the Analyze component that you haven’t instrumented, and you’re not measuring the latency there.
  3. System Issues: Oops. A garbage collection pause. Or a failover. Or a restart. Or whatever.
  4. Unexpected Code Paths: Your code has a bunch of paths in it to deal with edge-cases (e.g “strip semi-colons from the input”), and some of these trigger additional steps that you had forgotten about
  5. Time Issues: You just plain screwed up by making one of the — infinitely many — assumptions about time, such as that it increases monotonically, or everything is GMT, or whatever.

And this is just when it comes to measuring time. The point here being that you should be validating your metrics through multiple means, for all your metrics. In fact, if you are already doing this, and all the numbers line up, you should be very very worried — you’ve probably missing something!

So yeah, trust your metrics, after you’ve verified them…




(•) You’d be surprised how often I see this missed. “Did you instrument your code?” — “D-uh, of course!”“Grafana?” — “Dude, come on, what d’you think I am?”“When was the last time you looked at it?” — “Uhhhhh”

(This article also appears on my blog)