The Most Expensive Way to Learn About Reliability

When a software system or service fails, who owns that failure?

Is it the developer who introduced a certain misconfiguration or weakness?
Is it the 3rd party dependency that went down or had delays at the wrong time?
Is it the surge of unexpected users when a product in your store goes viral?
Is it the leadership's fault for pushing for development velocity at all costs?

Failure almost never has just one cause and one owner. That’s why many organizations have adopted a blameless culture to prevent this type of root cause interrogation that can infringe on the psychological safety of a development culture.

Does that mean that no one owns a system failure? Tell that to all the on-call engineers desperately trying to troubleshoot and restore their systems.

Failure is Inevitable and Everyone Owns It

Failures are something that an organization collectively owns. By even trying to successfully host an application, an organization is signing up for the inverse possibility of failing to meet that goal at any given time.

While the ownership is collective, the pressure of failures is still disproportionately distributed to SRE and operations teams who are keeping watch, often forced into a reactive mode and providing answers to other stakeholders.

When outages occur in production, reliability engineers have to get services operational again, investigate what happened, and put forward patch-work fixes to prevent that specific incident from reoccuring. By the time they have completed that process, there could be another incident or issue queued up already.

How can organizations start to design more fault-tolerant systems when their reliability engineers are stuck in a proverbial burning room, reacting to one incident after another?

The Most Expensive Way to Improve System Reliability

When outages occur in production, support tickets rush in and business grinds to a halt. Customers lose trust in the service and may start considering alternatives.

Many studies have surfaced figures for the business cost of 1 minute of downtime. This 2024 PagerDuty study estimated an average of $5,000 per minute of downtime. This study commissioned by BigPanda estimates that number to be closer to $14,000 for companies with over 1K employees.

With every outage, organizations are effectively spending thousands of dollars on each incident only to identify a single way that their systems can break.

Compare that to reliability testing or chaos engineering, where teams use experiments to proactively inject faults into their systems to monitor the impact in a controlled way.

Instead of waiting for events to occur in production to learn about their systems, teams can use chaos engineering tools like Steadybit to run countless experiments on their systems and map out their risks across services. Once teams have identified their biggest performance risks, they can start to design experiences to mitigate them.

Identifying Reliability Risks with Chaos Engineering

There are three main categories of reliability risks for your system: Redundancy, Scalability, and Dependencies.

Redundancy Risks

You could test whether your system has redundancies by testing failover processes to validate that there are resources available to pick up the slack. For example, you could run an experiment that simulates an Availability Zone outage for AWS. You can test how your application’s performance degrades and whether you have other instances in other zones available to take over.

Scalability Risks

If your organization runs any kind of promotion or major event, you could see load spikes and very dynamic demand requirements. Even if you have auto-scaling settings configured, it’s important to test that these actually are working in the way that you intended for them to function. You could run load tests to check how performance changes as traffic increases. But what if there’s latency from 3rd party service at the same time?

Chaos experiments enable you to combine load tests with other real-world factors and conditions to create the most useful reliability tests for your organization.

Dependency Risks

Modern software systems are more interconnected than ever, and AI agents take it all to another level. If your application relies on responses from agentic or 3rd party services, you risk one failed call having a domino effect. Mapping out your dependencies and testing failure scenarios is important to ensuring that you can create the best possible experience for your users.

Designing for Graceful Degradation

Graceful degradation is a system design strategy that aims to sustain core functionalities for users even when related services fail.

If a banking application is experiencing system failures that have caused an outage for the depositing functionality, other aspects of the app should still be usable. For example, a user might still be able to view their account balances and just see a notice indicating that some functionality may be limited at the moment.

This type of thoughtful design maintains and builds trust with a user instead of shocking them with performance that can’t match their expectations.

Shifting Reliability Efforts to Be More Proactive

If you want to get started with chaos engineering to implementing best practices for resilient systems, it’s best to work with experts who have been there before.

Our team at __Steadybitcollaborates with companies around the globe to help them identify reliability risks and design more resilient systems.
You can get started with a 30-day__free trial of the Steadybit platform orbook a call with our team today.