If you've ever launched a major feature, only to watch it spectacularly break under pressure, you know exactly why resiliency matters. As a Software Development Manager (SDM) at Amazon, I've been on the front lines of creating systems that handle real-world chaos gracefully. Let's talk resiliency—focusing on critical architectural patterns in plain language: Retries, Circuit Breakers, and Bulkheads. Retries, Circuit Breakers, and Bulkheads Resiliency Patterns Explained 1. Retry Logic: Giving Your Code a Second Chance Retry Logic: Giving Your Code a Second Chance Think of retry logic as attempting to call your friend again if they don't answer the first time. It’s a straightforward, powerful way to handle temporary failures—like transient network hiccups or brief outages. Key Points: Key Points: Exponential Backoff: Increase wait times between retries to avoid overloading services. Limit Retries: Cap your retry attempts (usually 3–5 attempts). Idempotency: Ensure operations can safely retry without side effects. Exponential Backoff: Increase wait times between retries to avoid overloading services. Exponential Backoff: Limit Retries: Cap your retry attempts (usually 3–5 attempts). Limit Retries: Idempotency: Ensure operations can safely retry without side effects. Idempotency: Illustrative Java Example: (Disclaimer: This simplified snippet illustrates a general concept. Real-world implementations are significantly more intricate.) Illustrative Java Example: (Disclaimer: This simplified snippet illustrates a general concept. Real-world implementations are significantly more intricate.) public DeliveryPromise fetchDeliveryPromiseWithRetry(Cart cart, Address address) { int maxRetries = 3; int waitTimeMs = 1000; for (int attempt = 1; attempt <= maxRetries; attempt++) { try { DeliveryPromise promise = deliveryService.getPromise(cart, address); if (promise.isValid()) { return promise; } Thread.sleep(waitTimeMs); waitTimeMs *= 2; // Exponential backoff } catch (InterruptedException e) { Thread.currentThread().interrupt(); break; } } // Graceful fallback return DeliveryPromise.unavailable("Cannot deliver to selected address."); } public DeliveryPromise fetchDeliveryPromiseWithRetry(Cart cart, Address address) { int maxRetries = 3; int waitTimeMs = 1000; for (int attempt = 1; attempt <= maxRetries; attempt++) { try { DeliveryPromise promise = deliveryService.getPromise(cart, address); if (promise.isValid()) { return promise; } Thread.sleep(waitTimeMs); waitTimeMs *= 2; // Exponential backoff } catch (InterruptedException e) { Thread.currentThread().interrupt(); break; } } // Graceful fallback return DeliveryPromise.unavailable("Cannot deliver to selected address."); } 2. Circuit Breakers: Knowing When to Stop Trying Circuit Breakers: Knowing When to Stop Trying Circuit breakers prevent systems from repeatedly trying—and failing—to call unresponsive services, much like the electrical breakers in your home that protect against overloads. How they work: How they work: Monitor error rates. Temporarily halt calls when error thresholds are crossed. Gradually resume traffic after the system recovers. *Source: Martin Fowler* Monitor error rates. Monitor error rates. Temporarily halt calls when error thresholds are crossed. Temporarily halt calls when error thresholds are crossed. Gradually resume traffic after the system recovers. *Source: Martin Fowler* Gradually resume traffic after the system recovers. *Source: Martin Fowler* *Source: Martin Fowler* 3. Bulkheads: Containing the Damage Bulkheads: Containing the Damage Bulkheads compartmentalize critical system parts. If one service crashes, bulkheads limit the impact—keeping your whole system afloat. Best Practices: Best Practices: Separate critical from non-critical paths. Provide dedicated resources and isolate failures. Separate critical from non-critical paths. Separate critical from non-critical paths. Provide dedicated resources and isolate failures. Provide dedicated resources and isolate failures. Real-Life Story: A Minor Bug with Major Impact Let me illustrate this with an actual scenario I encountered at Amazon. (Note: The following is an intentionally simplified and generalized example. Our actual scenario involved complex interactions across multiple services and teams.) The Problem: An obscure, deprecated code path resurfaced after a routine platform migration. Customers browsing with a default delivery address would occasionally hit a rare but disastrous scenario: Customer selects products based on promises calculated for their default address. At checkout, they switch to a different delivery address. The delivery promise is no longer valid for the newly selected address. Instead of gracefully handling this edge case, our system spiraled into unnecessary retries—ending with a cryptic error message. Customer selects products based on promises calculated for their default address. At checkout, they switch to a different delivery address. The delivery promise is no longer valid for the newly selected address. Instead of gracefully handling this edge case, our system spiraled into unnecessary retries—ending with a cryptic error message. Imagine you're shopping for your friend's birthday gift. You trust the promise, proceed confidently, but at the last moment, you hit a confusing, looping error. Even if just 1% of shoppers experience this, the cumulative impact—especially on a scale of thousands of daily orders—is enormous. This seemingly small oversight was costing potential sales and damaging customer trust. Technical Culprit: Misconfigured Retry Logic (Illustration) Faulty Java Code Example (Highly Simplified): Faulty Java Code Example (Highly Simplified): int retries = 5; while (retries-- > 0) { DeliveryPromise promise = legacyService.getPromise(cart, address); if (promise.isValid()) return promise; // No backoff, no break, repetitive calls } // Cryptic error message after long delay throw new DeliveryException("Delivery unavailable"); int retries = 5; while (retries-- > 0) { DeliveryPromise promise = legacyService.getPromise(cart, address); if (promise.isValid()) return promise; // No backoff, no break, repetitive calls } // Cryptic error message after long delay throw new DeliveryException("Delivery unavailable"); This loop would repeatedly attempt the same call without meaningful delays or graceful exits, causing prolonged confusion. Rapid Resolution: Graceful Fallbacks and Circuit Breakers We quickly fixed the issue by introducing proper retry logic and graceful fallbacks, supported by circuit breakers: Updated Java Example (Simplified for Illustration): Updated Java Example (Simplified for Illustration): public DeliveryPromise safeFetchPromise(Cart cart, Address address) { int retries = 3; int waitMs = 500; while (retries-- > 0) { DeliveryPromise promise = deliveryService.getPromise(cart, address); if (promise.isValid()) return promise; try { Thread.sleep(waitMs); waitMs *= 2; // Exponential backoff } catch (InterruptedException ie) { Thread.currentThread().interrupt(); break; } } // Immediate and clear user feedback return DeliveryPromise.unavailable("Sorry, this item can't be delivered to the selected address."); } public DeliveryPromise safeFetchPromise(Cart cart, Address address) { int retries = 3; int waitMs = 500; while (retries-- > 0) { DeliveryPromise promise = deliveryService.getPromise(cart, address); if (promise.isValid()) return promise; try { Thread.sleep(waitMs); waitMs *= 2; // Exponential backoff } catch (InterruptedException ie) { Thread.currentThread().interrupt(); break; } } // Immediate and clear user feedback return DeliveryPromise.unavailable("Sorry, this item can't be delivered to the selected address."); } Circuit Breaker Integration (Conceptual Example): Circuit breakers helped detect failing downstream services instantly and reduced wasted retries: if (circuitBreaker.isOpen()) { return DeliveryPromise.unavailable("Delivery estimates temporarily unavailable."); } if (circuitBreaker.isOpen()) { return DeliveryPromise.unavailable("Delivery estimates temporarily unavailable."); } Core Architectural Patterns Summarized: Retries: Simple way to handle temporary failures; use with caution and proper configuration. Circuit Breakers: Fail-fast mechanism preventing cascading failures. Bulkheads: Compartmentalize your architecture to limit failure scope. Retries: Simple way to handle temporary failures; use with caution and proper configuration. Retries: Circuit Breakers: Fail-fast mechanism preventing cascading failures. Circuit Breakers: Bulkheads: Compartmentalize your architecture to limit failure scope. Bulkheads: Key Takeaways for Every SDM: Expect Failure: It's inevitable; build resilience in. Judiciously Implement Retries: Useful, but avoid overwhelming your systems. Implement Circuit Breakers Early: Essential in distributed systems to maintain stability. Design with Bulkheads: Prevent total system failure by isolating critical services. Always Prioritize User Experience: Clearly inform users about system issues and failures. Expect Failure: It's inevitable; build resilience in. Expect Failure: Judiciously Implement Retries: Useful, but avoid overwhelming your systems. Judiciously Implement Retries: Implement Circuit Breakers Early: Essential in distributed systems to maintain stability. Implement Circuit Breakers Early: Design with Bulkheads: Prevent total system failure by isolating critical services. Design with Bulkheads: Always Prioritize User Experience: Clearly inform users about system issues and failures. Always Prioritize User Experience: Conclusion: Resiliency Is More Than Just Good Design Building resilient systems means proactively protecting your customers' experience and your team's sanity. Resiliency isn’t just technical—it's empathetic leadership, critical foresight, and practical risk management. Disclaimer on Illustrative Examples: All code examples provided here are significantly simplified for illustrative purposes. The actual scenario encountered at Amazon involved deeply intertwined services across multiple code packages and teams. I cannot disclose the exact implementation or proprietary details, but the lessons presented here are generic enough that many will recognize and relate to these patterns and solutions. Have you faced similar resilience challenges? I'd love to hear your stories and solutions!