The leading application stability management solution trusted by over 6,000 engineering teams worldwide.
This blog is part of a series that delivers insights from the monthly meetups of Bugsnag’s Apps + Coffee Connection (ACC) community.
In February, we kicked off our inaugural ACC meeting with an online roundtable discussion moderated by James Smith, Bugsnag co-founder and SmartBear SVP of Products. The conversation explored the different ways companies can empower engineering teams with metrics for everyday work and decision-making.
Ever wondered what key metrics other engineering teams are tracking?
The short answer: it varies completely. Even within the same organization, different teams look at different data and metrics, depending on responsibilities, needs, and deliverables. For example, customer-facing teams often have strict goals around bugs and exceptions, while data ingestion and processing teams care more about metrics around data integrity.
However, it’s safe to say that almost everyone views data and metrics as being closely tied to decision-making. As one participant stated, “Metrics definitely help you guide planning for what’s coming next,” regardless of what your development cycle looks like.
Not only does specific metric usage vary, but there are also different requirements around the level of metrics that are needed. Some teams focus less on “perfect metrics” and more on Agile-like retrospective metrics that speak to sprint success through traffic lights (green, yellow, red).
This approach enables conversations during retrospectives around the development process and how to work better as a team. Often referred to as “informative,” these types of metrics help encourage discussion and can lead to the adoption of new practices.
In contrast to informative metrics, “interruptive” metrics encompass things like customer complaints (escalated help tickets), critical security issues, and even the ping of new bugs.
These days, interruptive metrics are everywhere. Engineers receive constant notifications, emails, and Slack messages, and that leads to a daily struggle between addressing those interruptive metrics by fixing bugs, versus working on the roadmap and the tasks at hand.
“What we hear from the engineering leaders at the Netflixes, Pinterests, and Tinders of the world is that anything that interrupts work should be really, really meaningful. It should matter,” explained James. “You should avoid interrupting work because, as a software engineer, it’s really easy to get distracted and then really hard to pick back up again where you were.”
Naturally, the question arises: when should you use interruptive vs. informative metrics?
One strategy is to establish two stability metrics as team goals. The first goal is akin to a SLA, or a commitment metric, which is interruptive. If stability drops below this score, then it’s time to interrupt engineers, stop working on the roadmap, and address the bugs. The second goal is akin to a SLO, which is aspirational. If stability is above this goal, then teams can decide if they want to be more aggressive.
Before we leave the topic of interruptive metrics, it’s important to note that there’s never been a better time to have visibility as an engineering manager or leader, but all that visibility generates its own problems. For example, Slack channels eat up a lot of time. In many engineering organizations, managers don’t even know about half the Slack channels that exist, which could be dragging down the roadmap and slowing down development.
“I loved Slack when we first started, especially the fact that anyone can create a group and a bot and integrate a resource into it,” stated a participant. “But now, at least half the time, I have no idea until there’s an escalation that something’s been going on. It takes so much time to keep track of the spikes various groups are seeing. How do you cut through the noise?”
The general consensus is that there needs to be a focus on standardized alerting. As one participant shared, “I find that standardizing on the approach helps holistically how you can approach the day-to-day triage management.”
Another method is to think about monitoring from the customer experience perspective and then bubble up issues accordingly. Ask yourself, what is the customer seeing? These days, it’s not as much about classic system monitoring; it’s about product and customer monitoring.
When the subject of dashboarding tools came up, answers varied a great deal and demonstrated a breadth of pain and challenges around dashboard views.
First and foremost was the recognition that multiple tools are still needed. As a participant stated, “I don’t believe there’s one tool — and I don’t believe there’s ever going to be one tool — that will solve for one BI response. I think you always have to have a mix, and I think that’s to the benefit of the team.”
Another point made was that there’s a huge problem with too many charts on a single dashboard. As James shared, “Sometimes you end up with the mentality, ‘We can dashboard this, so we should dashboard this.’ I see it happen with engineering, sales, and marketing where there’s simply too many charts on a dashboard. What I want is one chart that helps me understand whether things are good or not right now as a key stakeholder, and then let me drill deeper if I want to do so.”
The same applies to KPIs and metrics. Teams are advised to set only a few KPIs so that everyone can work towards what’s most important, and standardization should be used to help cull metrics and help focus what executives and teams think about to a meaningful number. Again, the best advice offered up is to take the customer point of view and focus on the two most important things: Are services up, and are they stable?