Over the last two decades, my code has been deployed in a live environment.
I disrupted stress testing processes on Black Friday, rendered user authentication impossible at 2 am, and saw a system handling 40 million users break due to a minor modification in the configuration file.
It is not about being a bad engineer. It is about being practical.
Every senior engineer I respect has a war story. What separates them from those living in chaos is simple: great ones who have seen it before built their models around recovery. No dumb luck or heroic save.
Reliable deployments require all three to work jointly. A crack monitoring system that detects slow-building problems in seconds is required. You need backoff strategies so that you can initiate the rollback without even blinking. Having a playbook for recovery beforehand is crucial; one should be prepared before the need arises.
I will now walk you through what each of these systems looks like.
1. Monitoring: See Everything Before Users Do
Monitoring exists in nearly all teams. However, most teams keep on overlooking outages for 8 to 12 minutes after every deployment.
This is the gap between the two. Not even lack of tools. But false signals.
Over the course of two decades, I have finally narrowed it down to four metrics that matter for every deployment. Google calls these Golden Signals. I call them the only things worth waking up for.
Failure rate: This does not count failures; rather, it is the percentage of failures to successes. Error rate.
P99 latency: Approximately the slowest one percent of users. There is no chance for the average latency to hide a disaster.
Traffic uniformity: A sudden drop in the distribution chart is as alarming as an unpredictable burst. Either of these might signal something that has gone wrong.
Saturation: CPU, memory, connection pool headroom. How close are you to the cliff?
Set all four of these up as alerts and hook them into your deployment pipeline. If a sudden spike shows to be on record within two minutes of a push, you need to know right away.
Below is the Prometheus alert that I use for error rate. Simple. Effective. It alerts me even before the users start complaining.
Below is the Prometheus alert that I use for error rate. Simple. Effective. It alerts me even before the users start complaining.
- alert: HighErrorRate
expr: rate(http_errors[5m]) / rate(http_requests[5m]) > 0.05
for: 2m
annotations:
summary: 'Error rate above 5% - check recent deploy'
A 2% threshold is fine during office hours, with an increase to 5% overnight, provided you adjust it to suit your traffic patterns. The actual number is not the main concern; the essential thing is to be alerted about it.
Teams make the error of sending alerts for every possible event. Alert fatigue is a genuine problem in the field. Within a month, your team will stop paying attention to pages if there are too many fires.
Choose four signals from the provided list. Create alerts which have significant value. The first ten minutes of normal deployment warm-up should be used to silence regular operational activities. The next step is to monitor the situation with intense observation.
2. The Five Rollback Strategies That Actually Work
Rollback does not exist as a single operational procedure.
Teams tend to manage it as if they can simply flip a switch to control it. The system requires five different operational methods. Each method operates best in its specific usage situation. The incorrect choice will result in time loss which you cannot afford.
You must learn all five methods before your upcoming deployment.
Strategy 1: Git Revert
The unsharpened device. Most rapid in execution. Always available.
Your initial action should be to create a new commit that reverses the change. The deployment process will begin after you push the commit. The pipeline will proceed to redeploy the system.
git revert <commit-hash> --no-edit
git push origin main
Opt for git revert rather than git reset. Revert helps maintain a clear history of modifications. Reset rewrites it. The shared branch history should never be changed through pressure.
The execution time will take three to four minutes when your pipeline operates at high speed.
Strategy 2: Blue-Green Switch
The organization maintains two identical production environments. One environment operates. One environment remains inactive.
You deploy to the inactive environment. Smoke test it. Then flip your load balancer. You should restore the previous state. The rollback process works at the speed of a configuration reload.
# Roll back with one AWS CLI command
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$BLUE_TG
Time to execute: thirty seconds.
Tradeoff: double the infrastructure cost. Worth it at scale. Evaluate for your budget.
Strategy 3: Feature Flags
The most surgical tool you have.
You do not roll back the deploy. You kill a flag. The broken code path stops executing instantly. Everything else keeps running. No pipeline. No infrastructure change.
if (flags.isEnabled('new-checkout-flow', userId)) {
return newCheckout(cart); // kill this flag to disable
}
return legacyCheckout(cart); // always-safe fallback
Time to execute: ten seconds.
I have used this to instantly disable a broken feature for twelve million users without touching a single deployment. Wrap every high-risk code path in a flag. Do it before the deploy.
Strategy 4: Canary Deployment
This one prevents disasters instead of cleaning them up.
Ship to one to five percent of traffic. Watch the metrics for fifteen minutes. If they look bad, delete the canary. If they look good, roll out to everyone.
# 1 canary pod alongside 9 stable pods = 10% traffic
kubectl scale deployment api-stable --replicas=9
kubectl scale deployment api-canary --replicas=1
Your worst case is now five percent of users saw an issue. Not one hundred percent.
Every team that adopts canaries wonders how they shipped without them.
Strategy 5: Config Rollback
Sometimes the problem is not code. It is a setting.
Environment variables. Connection pool sizes. Timeout values. Rate limits. These change constantly. They break things in ways that look exactly like code bugs.
Keep your config versioned. Keep your secrets in a vault that supports versioned rollback. Know which config change shipped alongside which deploy.
Time to execute: sixty seconds.
Most underused rollback in the industry. Add it to your playbook now.
3. Failure Recovery: Write the Playbook Before You Need It
The worst time to figure out your recovery process is during an incident.
Your adrenaline level is elevated. Slack is experiencing excessive activities. Your CEO has sent you a direct message: Your mind is unable to function properly. The situation you face is a biological issue that should not be viewed as a personal failure.
Teams that recover within five minutes are not necessarily more intelligent. They prepared for this ahead of time.
The Incident Response Loop
Every occurrence moves through the same five stages. Your mission is to sail through quickly.
- Detect (under 2 minutes): Alert fires. On-call engineer acknowledges. Incident channel opens.
- Triage (under 7 minutes): Is this P0 or P1? How many users are affected? Is it the recent deploy?
- Mitigate (under 20 minutes): Stop the bleeding. Rollback, kill a flag, scale up. Users first.
- Resolve (under 60 minutes): Find root cause. Ship permanent fix or confirm rollback holds.
- Review (within 48 hours): Write the post-mortem. Assign action items. Close the loop.
Typically, teams complete the first three with ease. They bypass the review step.
The review process stops repeated incidents from occurring again. The report needs to be written in a way that it assigns no blame and provides clear steps for future action.
The Runbook You Should Write This Week
The runbook document provides engineers with a guide to follow during emergency situations which occur at 3 AM when they lack sleep. The document provides particular instructions which address particular failure modes of the system. I maintain a complete document for every service which I manage.
Here is the minimum it needs:
- Symptoms: What does the alert show? What does the dashboard look like?
- First check: One command to confirm the diagnosis without making anything worse.
- Mitigation: The fastest path to stopping user impact. Even if it is not the permanent fix.
- Escalation: Who to call and when. After thirty minutes without progress, someone else gets paged.
- Done state: How does success look like, and when exactly do you think of closing an incident?
That final point carries greater importance than most people regard. The absence of a definite completion state causes incidents to continue indefinitely. Engineers persist with their debugging assignment until they reach a point where users no longer experience problems.
Game Days: Practice Before the Real Thing
The requirement mandates the execution of a scheduled quarterly test which involves intentional system damage. The testing process requires the selection of either a staging or a non-production environment. The procedure requires you to execute the rollback process while you record the duration of your operations.
My first attempt at this with a new team revealed that three of the four documented rollback steps had become unusable. The infrastructure underwent modifications, but the team failed to detect them.
We found that on a Tuesday afternoon. The discovery occurred outside the Friday night incident time window.
The single exercise we performed saved our organization from this danger. You should execute the process at regular intervals because it will provide you with the same benefits which we received.
The Bottom Line
The tasks at hand require no complex skills to complete. The installation process for Prometheus takes one afternoon to complete. The process of git reverting requires thirty seconds to complete. The development process for a runbook takes two hours to complete. The implementation of a feature flag requires one entire sprint duration.
The challenging task requires execution during system operational status. The active system operation requires work to produce results. The most important work needs to be done first before anything else can be accomplished.
The teams that achieved five-minute recovery times invested their resources on a Tuesday when everything was calm. The recovery process occurred at a time when no operational problems existed.
Begin your work with the establishment of monitoring systems. Choose one rollback method that matches your system architecture requirements and create a documentation record for it. Create a runbook document for your most important service.
The existing materials provide sufficient information. The three tasks you must complete will already make you more advanced than typical teams I have encountered in my previous work.
The upcoming software release will cause system failure. Design your system to handle failures without creating panic among users.
