How to define and spend your tech debt budget

If you’ve ever been involved in sprint planning and argued to carve out time to pay back some technical debt (i.e. define a tech debt budget), this is how to go about it.

It’s hard enough to make room for competing features in a sprint, but trying to justify sacrificing some of these features to spend 10–20% of engineering time on paying back technical debt before it’s too late is a crusade of epic proportions.

I’ve attended numerous meetings between product and engineering where too many hours are wasted on this very topic. Emotions run high, the decision is informed by anecdotal evidence, and the opinion of the loudest person in the room prevails.

The sum total of these poor decisions often leads to dramatic consequences. The (slightly caricatured) dilemma is: If business pressures take over, the company risks taking on too much technical debt, engineers end up demotivated, the company goes technically bankrupt, and their competitors win. If engineering pressures take over, the company risks taking on too little technical debt, competitors ship products and features faster, get the lion’s share of the market, and use that cash to pay back their technical debt later. Again, you lose.

Conventional wisdom says that the solution to this problem is for software engineering teams to build an intuitive sense for the codebase, where technical debt lies, the effects it’ll have on the company, and build trust in the organisation. If your founding Chief Architect repeatedly tells you you need to refactor core code right now, you (usually) just do it.

This is all well and good, and we should all strive to retain our engineers, create a culture of knowledge sharing and trust. But it takes years of hard work, and we still come out the other end of a refactoring effort having virtually no idea whether our time was well spent. Could we have waited a little longer before paying back this technical debt and shipped a few more features instead? Did we just save days of future engineering time? We’ll never know for sure.

This is often chalked up to product development being more art than science. It’s about time we injected some science into it.

Technical debt taken on deliberately and managed properly is an amazing tool! Much like financial debt, you can use it for extra leverage. On the other hand, if we unknowingly take on too much or without really understanding the terms of the deal (i.e. the impact on your codebase, customers, team, and business), it can lead to any software company’s demise.

As it turns out, this is exactly how the best Site Reliability Engineering teams think about their site reliability budget, a concept popularised by Google. Site Reliability is responsible for keeping software products up and running. Interestingly, unlike what a lot of people might think, companies like Google don’t aim for 100% uptime. That’s because 99.99% uptime, which is enough for Google products to appear supremely reliable to real-world users, is exponentially easier to reach. That last 0.01% simply isn’t worth fighting for.

Consequently, if this allows them 52 minutes of down-time per year, Google will want to get as close to this number as possible. Any less than 52 minutes of down-time is a missed opportunity to take extra risks and deliver more ambitious features for their customers faster.

Think of your technical debt budget like your site reliability budget. Provided it’s prudent technical debt you’re taking on, and that you remain below the maximum amount of tech debt you can take on before your customers and business start getting affected, you should be taking on more technical debt to take more risks and beat your competitors.

When your technical budget is in the red, pay some of that debt back. If it’s in the green, you can afford to take more risks and take on more debt. Your goal is to constantly stay as close to your ideal amount of tech debt as you can. In other words, if you’re at the peak of the red portion of the graph, the ideal tech debt budget is A ⇒ B. If at the peak of the green portion, it’s B ⇒ C. A ⇒ C is too big a budget.

Because technical debt can now be measured — a subject we wrote about in another article — this isn’t just conceptual anymore, it’s fully practical.

How to get max bang for your tech debt budget buck

The appropriate tech debt budget is one that brings you back down, or up, to the maximum amount of technical debt you’ll tolerate. In order to define that budget, you’ll want to identify the areas of your codebase where tech debt is worth paying back immediately. Debt worth paying back immediately is debt that will get in the way of your company reaching its current set of objectives. You obviously don’t want to pay back too little debt, but you also want to avoid paying back too much.

Not everything needs refactoring. If it’s not critical, or nobody needs to improve its functionality in the next months, or it’s just too complicated, consider acknowledging it as tech debt.

Andreas Klinger, Head of Remote at AngelList, Refactoring large Legacy Codebases

Long story short, your goal is to identify the intersection of things you’ll work on this sprint, month, or quarter, and parts of your codebase that have tech debt. Then, pay off the debt in that intersection but not outside of it.

And that’s where the science complements the art. You can use data to identify areas where you need to pay back tech debt soon:

Identify files in your codebase with weak ownership. These are prone to problems as code ownership is a leading indicator of your codebase’s health. More on this in my article Why you want an engineering culture of ownership.
Measure cohesion and coupling for these files. You’ve now pruned your list to a set of files with weak ownership, low cohesion, and high coupling. More on what each of these metrics are in our article on 3 metrics to understand and manage technical debt.
Calculate churn for each of these files to identify the subset of problem files. As Microsoft Research showed, while active files only make up 2–8% of the total systems, they are responsible for 60–90% of all defects.
Compare these files with your roadmap for the quarter. Will any of the features listed on your roadmap require engineers to work on the subset of problem files you’ve identified? If so, target these files for refactoring, estimate the work required, assign it to the engineers who should be owners of these files, bake this job into your plans.

Et voilà.

Get in a long-term relationship with tech debt

Having implemented this data-driven approach at Stepsize and with multiple top-notch software companies, not only have we found the topic of technical debt a lot easier to broach, but we now know how much debt we’re willing to take on, when and how to pay it back, and rarely wonder whether we made the right trade-off between new features and tech debt. We’ve removed a big chunk of guesswork and a lot of fear and anxiety went with it.

To be clear, this is no silver bullet for software engineering teams to use once a year and call it a day. You need to get up close and personal with your tech debt, track progress on all these metrics each sprint, and keep improving at this whole process to reach technical wealth.

Also, check out Stepsize. We built a SaaS product to do all that, and you can try it for free :).