Let’s take for granted that you’ve done the right thing — you’ve generously instrumented your system, and are actually paying attention to the metrics that you’ve generated (•). The question on the table is — “Do you actually trust the metrics that you are generating?” (Hint: You shouldn’t)
Let’s look at something fairly straightforward, the request/response path as shown below
You would think that the ResponseTime would be the sum of each of the processing stages, right? i.e., ResponseTime = 10 + 1 + 20 + 1 + 5 = 37ms
?
But, since you shouldn’t trust your metrics, you
ResponseTime
directly, and
Charted/alerted on deviationsand you found that the actual ResponseTime was, say, 52ms.
That’s quite a difference, no? As to Why it was 52ms, let’s look at a bunch of possible issues
validate_user
interval instead of validate_users
(spelling issues with APIs. yay.)Analyze
component that you haven’t instrumented, and you’re not measuring the latency there.And this is just when it comes to measuring time. The point here being that you should be validating your metrics through multiple means, for all your metrics. In fact, if you are already doing this, and all the numbers line up, you should be very very worried — you’ve probably missing something!
So yeah, trust your metrics, after you’ve verified them…
(•) You’d be surprised how often I see this missed. “Did you instrument your code?” — “D-uh, of course!”“Grafana?” — “Dude, come on, what d’you think I am?”“When was the last time you looked at it?” — “Uhhhhh”