As a company dedicated to customer success, it’s important to us that we catch and address issues as fast as we can, ideally before customers notice. Writing perfect code is probably impossible, but catching exceptions in our code before they impact customers is something every developer strives for.
Installing an exception tracker is a great first step towards catching problems earlier. However, there’s a big difference between reporting every error, and reporting important exceptions. If you’ve ever taken an intro to psychology course, you may have heard about “Selective Attention” and “Inattentional Blindness”, where psychologists study how the human brain chooses to focus in on certain stimuli when several are occurring at once. Here at Olark, we’ve found that without a finely tuned exception reporting and handling process, we had trouble at times picking out the signal from the noise.
We were logging 91 events a minute according to Sentry, our real time error tracker. It’s very easy for a startup to get caught up in the hustle and bustle of pushing their platform forward and hand-waving away the long tail errors that may be piling up in the background. With so much stimuli, we found ourselves focusing our attentions elsewhere.
Fortunately, we knew that the majority of those 91 events a minute did not directly impact our customers, but it was still less than ideal. As developers, we knew we could do better, and make our error tracker become both a useful tool in our debugging process, and a decent marker of code quality in general.
We decided to dig deep and see what we could do to redirect our attention back to meaningful exceptions.
Here’s what we found:
Rate limiting is a great thing. When we do experience downtime and the same thing is broken for everyone, there really is no need to get hundreds of notifications all reporting the same exception. But if you’re logging so many irrelevant exceptions that you’re constantly getting rate limited, you run the risk of accidentally filtering out unique exceptions.
We decided our primary measure for the success of this project would be to reduce the percentage of our rate limited exceptions. We looked at some of our heavy hitters: exceptions that were throwing thousands of times a month. Was something actually wrong? Were we throwing an error appropriately, and if so, why were we throwing them so often? Could we use another method, such as an early return, to escape out of the workflow instead?
Ultimately we decided that we were throwing errors appropriately. We had several instances like typing the wrong email, or credit card information, where an error was warranted but an exception was not. The trouble was not that we were throwing errors, the problem was that we were reporting these as exceptions. By reducing the volume of exceptions we reported, we made a lot more room for the important ones to rise to the top.
Fortunately Sentry clients give us a configuration setting to ignore certain exceptions by default. Rather than using a generic “Error” and logging nearly all of them, we started subclassing all of our errors and only logging the ones that we thought we’d most like to know about (server errors for example).
We also discovered another potential way to clear out some meaningless exceptions. By default, Sentry installs a global error handler that will catch any error in the browser, including ones that have nothing to do with our website — a browser extension exception for example. Sentry allows for the option to ignore exceptions from certain urls or alternatively to whitelist some urls to exclusively listen to. We created an npm package called Lumberjane that contains our ignore settings for all of our frontend applications, which has helped reduce noise.
After successfully ignoring these expected errors, and focusing our attention on real exceptions, our stats started improving very quickly. Soon we were only being rate limited during downtimes, and when we did look at Sentry we actually found new exceptions we hadn’t seen before that shed light on issues we were already investigating.
Investigating an exception from a few weeks ago can be tricky. We usually deploy code multiple times everyday. Unless you’re very familiar with a particular error, if you’re more than a few days out from a deploy, it’ll take some serious digging to figure out which deploy was the source of the bug. This is fortunately an easy fix, as Sentry’s releases feature will tag each exception with a git commit.
We also weren’t taking full advantage of the fact that Sentry allows exceptions to be tagged with additional context, a user id or email for example. By ensuring that we attach these sorts of pertinent details it’s easier to see how many and what kinds of customers are affected by a given exception, and better yet, allows us to directly follow up with customers if we notice anything concerning.
Even when we knew an issue was occurring, our frontend developers often found themselves staring blankly at minified code. Minifying code is great for loading apps quickly, but nearly impossible for human eyes to look at.
Fortunately, Sentry allows us to upload sourcemaps to their servers directly. Through that process we can see the relevant files as part of the stack trace, exactly as they were when we wrote the code, unminified, and before they were transpiled from ES6 or Coffeescript.
Last but not least was the issue of process. While high severity exceptions are easy to justify taking time from projects to work on, as a team we didn’t always know what to make of the low severity bugs. While Olark as a company puts a high level of trust in engineers to work on what they think is important, as an engineer it’s not always easy to know how much effort to put on low severity bugs. I find with hard decisions, my default tends to be to postpone. Small bugs that we could knock out in a few hours would just sit there in the queue, with the queue becoming more and more daunting to look at.
We’re still testing out what process works best for us as a team and for the company as a whole. One idea we’re trying out is to assign each developer one bug a week to investigate in addition to their regular work. It’s the amount that we might do on our own anyway, but in this way it’s a formalized process and serves as both permission and encouragement to focus on the small potatoes.
By identifying these issues and researching best practices for dealing with them, we’ve gone from 91 events a minute down to 1–3 events a minute, from nearly always being rate-limited by Sentry, to only getting rate-limited during actual downtimes. We can definitely still make improvements, and we’re still learning, but we’ve come a long way in creating an infrastructure and process which will support engineers in building high quality code, which hopefully translates into an easier, smoother experience for our customers.
About the Author: Sarah Zinger is a Callback Handler at Olark. She lives in New York and enjoys SciFi, long hikes through the city, and volunteering to make the New York City tech scene more awesome.