a short example of why dimensions are suuuuuper valuable

We use a bunch of tools to observe and monitor what’s happening in our production systems at Mode Analytics. Two that we use are NewRelic and Honeycomb. Both have their place, what I want to do here is illustrate what Honeycomb brings to the table and why it’s suuuuuper valuable.

This morning I opened NewRelic and saw the following chart:

Cool, now I know that from ~2100–0200 that the average time spent in Ruby increased by 2x. Seems not ideal, I wonder what happened there? First, I want to point out that I removed some data from the chart above. Normally it includes time in: queue, gc, middleware, ruby, database, redis, ar and external. In this case those were all flat, indicating that the inflection above wasn’t in one of those components.

Great, so I want to investigate what’s going on with or in the App. It seems likely the problem is there and not propagating from a dependent service. NR has more nice out of the box charts for this. I wonder what the real impact here was, averages hide a lot of information. Happily there’s a %-tiles button:

Well, I learned a little here, not as much as I want since NR only provides Average, Median, 95 and 99. I’d like to see the progression from 75, 85 to 95 and beyond but oh well. This gives us little more information but it’s not that actionable. I know that the 95% was ~500ms and it’s now 1000ms, that’s consistent with the first chart along with a shift in magnitude.

A couple of possibilities come to mind here:

Something is slower than it was before?
A general or specific increase in traffic?

I know through deploy markers that we didn’t deploy right before or right after the event so we can discount that. It’s more likely that we either had more traffic in general or more traffic to a specific resource/from a specific customer.

NR makes it some what easy to see. I can sort Transactions on various dimensions but nothing really stood out. After clicking around for a while I found a Controller-Action who’s performance timing matched the chart above. An accidental benefit from clicking around was confirming that a majority of Controller-Action’s were not affected.

Awesome, now that I’ve found the Controller-Action, I can see both that it’s slower and that the request rate went up significantly during the window and then down at the end.

… and this is where my journey with NR ends. I don’t know of a good way in NR to dive deeper in a really effective way. I’ve had marginal luck with their Insights product.

Instead, I switched over to Honeycomb. First, I was curious if I could do anything to simplify finding the Controller-Action. In NR I had to click and hope and click and hope. In Honeycomb I created the charts below, they’re sort of noisy but I’m looking for horizontal lines and they’re pretty easy to spot:

Interesting but this one doesn’t match the timeframe I’m looking for.

… but this one matches perfectly!

It was a lot easier to find these in Honeycomb that in was in NR. Maybe there’s a good way in NR but IDK. Regardless, we’re now back where we were with NewRelic: we know which Controller-Action, where can we go from here? This is where Honeycomb really shines: high dimensional data, one of the attributes we include is an account identifier. Let’s add that to the break down:

Oh look, it’s customer-1

… and tada, I know that customer-1 is making more API requests than the typically do. In a different world maybe they were making the same number of requests but their requests took longer than other customers indicating some asymmetry in how our application performs. Or maybe we would have found that it was a specific report, …

the key take away is that by being able to include high dimensional data with your events, you can drill into those events. Before Honeycomb, to come to the same conclusion: I’d have switched over to SumoLogic to look at App or Nginx logs. I’d probably have been able to figure it out by reducing and raking request paths and knowing how the tokens in those paths map to runs and then mapping the runs back to accounts. It’s tedious, it’s slow and requires being fluent in a vendor specific query language and then moving intermediary results from one system to another.