Fire in production!

I guess that most of us have some horror stories to share about running our applications in production. Things that did not work as expected. Things that quickly got out of hand. Or even things that stopped working for no apparent reason. A great book with a handful of such horror stories is Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers) by Michael T. Nygard.

A reasonable question to ask ourselves is: why after all the effort in designing and developing the software the quality of the production system is poor. At the end of the day, the output of all the development effort is the system that we build, and if that system does not meet the users’ expectation, then this effort doesn’t matter (a lot).

Almost every system/application has bugs. The topic of this post is what is the mentality that the engineers should have when their system goes live. There will always be unexpected situations in production, but the bottom line is how to limit the amount of them, how not to introduce new ones and most importantly, how to learn from mistakes.

The reason

No matter how clean our code is, or what development process we use, usually the reason for the problems in production is that there is a gap between development and production.

The production environment is different from the development environment regarding the infrastructure and the workload. Usually, the gap between the production and the development environment grows along with the complexity of our system. Staging and testing environments can (partially) fill this gap but only if they can reflect the production environment, which is not usually the case. The (sad) truth is that the cost of building and maintaining an environment similar to the production is high and very often is the first that is cut down.

In microservices, it is considered a showstopper to begin developing the services without CI/CD. Microservices are inherently complex, so experience has already proved that you play against the odds if you don’t have CI/CD from the very beginning.

Unfortunately, we tend to fill this gap with assumptions during the design and the development phase. These assumptions need some creativity and imagination and naturally, reduce the confidence. We hope to get them right. So, if we don’t have a way to validate these assumptions then, inevitably, they will be proved wrong at the worst possible moment (believe me that Murphy’s Law is a thing).

The whole lifecycle of the software, from analysis to deployment, should have minimum assumptions.

Treat production with extra care

When things get out of hand, we need to act immediately. Our primary goal is to resolve the problem and bring the system back to a healthy state. Also, we should be able to gather data for some post-mortem analysis.

In the process of getting the system back to a healthy state, we might be tempted to do some manual hacks. These hacks might include manually changing some configuration files, restarting instances of the running services or even changing the code/artifacts. Although these hacks can save us by bringing the system up again, they usually come with some price. We need to keep track of whatever we did; otherwise, we will end up with a configuration running on production that is unknown.

By all means, we should avoid this horrible situation. Having a system with unknown configuration is worse than not having the system at all since no one can tell how the system behaves. It is like gambling, and it compromises every good effort during the previous phases of the software lifecycle.

Remember: the quality of a process is the minimum quality of all subprocess. If we don’t pay attention to one part of the lifecycle, we will end up having a poor lifecycle overall.

How to deal with production issues

The best way to deal with production issues is to do some analysis beforehand and establish processes that will be followed when an issue occurs. Use a bug tracking system to log the issues that have happened. Have a well-defined process to change the data if needed. Take a snapshot of the system while it is in the problematic state to examine it later.

But most importantly, have a process for everything that you think it might come up. No matter how trivial this may sound, you would be surprised by how much pain can cause the lack of processes. We need to treat issues and bugs as first level citizens instead of rare incidents. Things will go wrong, and we have to live with that!

There are two categories of problems that may occur in a production system: the business-related issues and operational issues.

Business-related issues

Business-related issues include whatever is preventing our users from getting their job done. They usually occur due to a bug or missing feature in our system, but that’s not always the case as we will see in the following paragraph.

We should design our software in a way to support changes so that we don’t have to edit the data directly in the database manually. There will always be a need to change the data in our system, and if we do it directly in the database, then we might leave our system in an inconsistent state. Let’s say for example, that we have a user complaining that they cannot add items in their cart because something in the frontend is broken and you have to add it manually.

We need to have special endpoints/services/tools to do that for us and avoid adding it manually. There are two reasons to do so. First, adding an item in the cart might mean more than a record in the database. Our application may send messages to some external system for analytics, etc. Secondly, these special services will usually be used by the support engineers (SREs), who might not know the internals of the system, and even if they do, they might not be up to date with recent changes in the system. The more complex an operation in our system is, the more prone is to errors if it is done manually.

Operational issues

Operation related issues include failures in the machines that we have deployed our application, network problems, etc. The more complex our infrastructure is, the harder is to deal with issues manually. Imagine we have tens or hundreds of nodes in our infrastructure and some of them start failing, it is nearly impossible to deal with all of them at once and resolve all the problems. Even a simple upgrade of the application might take weeks and is prone to errors that might be very serious.

We need to use tools to automate these processes. From a simple database migration to a massive redeployment of all our nodes, we need to eliminate the human interference as much as we can. Thankfully, there are a lot of tools and techniques out there that can help us come along with such situations. We should use CI/CDs tools and practices to automate the deployment of our application. Delegate the pain of handling the deployment and management of our infrastructure to tools like Kubernetes, reduce the downtime of deployment by using techniques like the blue/green deployment, have tools like ELK or New Relic to keep track of everything that is happening in our system.

Of course, some of these tools might be too complex or expensive for our case, and we might consider creating our own. Before we start building our own tools, there are a few things we should consider. First of all, these tools are very complex to build. They have evolved through several years in production, and the people who have created them are specialists in this domain. Secondly, the majority of attempts to develop custom tools end up like this: there is a tool that one a few people know how it works and, it is getting outdated, but these people are overloaded with other tasks, and they have no time to spend on the maintenance of the tool. New people are scared to touch this tool, due to its importance and the lack of documentation. Hence, a critical part of the lifecycle of our system depends on a few people with no capacity.

The advice for the operational problems is straightforward. Unless the development of such tools is going to give you some competitive advantage, use the standard tools and hire some DevOps. Don’t risk getting out of business by trying to reinvent the wheel. This is how it works nowadays.

Learn from your mistakes

Being proactive means developing processes and practices that prevent errors from happening, but there will always be new situations that we haven’t faced before. I consider these situations valuable since I can learn from them. In such cases, we need to be reactive, embrace failure and have a plan when it happens.

Triage

Once an error is reported, we must be able to evaluate its importance and impact. Errors can vary in severity. They can manifest only in rare cases, they can affect specific users, or they can cause the whole service to crash and burn.

Troubleshoot

Good systems are debug-able by design. Debug-able means that any (support) engineer should have the required tools to examine the health and the state of the system. This can be logs, dashboards or debug APIs (microservice architectures should also correlate requests so that they can be traced end to end). Logs are really important when you are trying to understand a problem. We need to make sure that developers are correctly using the different log levels (ERROR, WARN, INFO, etc.) and provide useful insights about the what happened and why. Just logging the stack trace is not enough. Servers and frameworks expose metrics about the memory/CPU, the number of requests, latency, etc. We should also provide debug dashboards or APIs about various critical processes of our systems. If we have a scheduled critical job that must run periodically, then we should provide a debug API that at least provides the success/failure ratio and progress of the running job.

Postmortem and history

We should strive to create tests for every error report that we encounter in production. Having an automated test suite is THE process of being proactive and ensures that the bug will not appear again (unless the suite has some sneaky flaky tests). Also, writing (postmortem) reports about the incidents that have occurred will help us understand in greater details the problem, why and how it happened.

When an issue comes up, we should do our best to find out its root cause and prevent it from happening again. Having a system that every now and then fails for some reason that we cannot understand is a nightmare.

Read “Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers) “ by Michael T. Nygard, and you will get an idea of how many different things that seem innocent can bring our system down. Read engineering blogs on how they have dealt with their problems. Study similar systems. Learn from others’ experience!

Monitor your system

As already mentioned in the previous sections, there are a lot of tools for monitoring our systems. We can see how spiky workloads are stressing the infrastructure, have alerts when we reach the limits of some resource or have some aggregated views on our application logs.

We should always keep an eye on these monitor tools because they can show indications of future problems. For example, if for some reason the database is overloaded, but the workload is normal, it might be an indication of an issue that is about to manifest.

The data of these monitoring tools should be used to drive the documentation of the behavior of our system.

We have to know what the capacity of our system is and how it behaves in specific configurations along with the limits and the problems of the system.

Conclusion

No matter what we do, there will always be problems in production. This is a truth that we have to accept. The only thing we can do is to educate ourselves to minimize the risk of the issues and treat the system with professionalism.

“May the queries flow, and the pager stay silent.”

— Traditional SRE blessing