There are only two hard things in Computer Science: cache invalidation and naming things - Phil Karlton
Caching is an important technique in system design and offers several benefits. With caching, you can improve the performance and availability of your system while simultaneously reducing the cost of operating your service. Caching is the Swiss army knife of system design.
In this article, I will primarily talk about availability and resiliency, which indirectly help with the other aspects of system design. For example, a service that serves its user requests mostly through a cache can reduce the number of calls made to the back-end system, which is one of the main sources of costs incurred to run the service. Similarly, as the cache can serve customer requests quickly, the service can support a higher rate of requests per second (RPS). Not only this, but if a service can serve requests through cached data, it reduces the pressure on downstream dependencies, preventing them from failing or browning out, and hence helping the service using it.
But as the saying goes, there is no free lunch in this world; caches come with their downsides as well. The performance of a cache is typically measured in the form of cache hits or misses. All things work perfectly as expected as long as you have a high cache hit ratio. The moment you start to see a relatively high ratio of cache misses, you start to see its downsides. For example, a service can serve 10k transactions per second (TPS) built on a service that can only do 1k TPS. On a good day, there is no problem, and it can withstand a failure rate of up to 10%. But on a bad day, when for some reason > 10% of requests fail to get the data from the cache, they end up making calls back to the service and risk browning it out. Now, if the service takes time to recover, the entries that are good right now will also retire and cause cache misses, leading to a further increase in traffic to the back-end service. In my last two articles about retry dilemma and backpressure, I talked about how a slight change in behavior leads to elongated outages if proper guardrails are not in place. Similarly, caches need guardrails and careful consideration.
Before going into the details of what can be done to avoid these things, let’s see some of the common cache failure modes:
Failure Modes
Cascading Failure: The Doom Loop
I call this a doom loop because, once it starts, it can cause a complete outage. Let me explain. Typically, a cache comprises multiple cache nodes to distribute the load. If one node fails, the load shifts to the other nodes while a new node comes up and starts to take traffic. Now, imagine a scenario where every node in the cache cluster is running hot, and one node fails. The traffic from the failed node spills over to the other nodes in the cluster, causing them to become overloaded and eventually crash. With more nodes out of the cluster, the load shifts to the remaining healthy nodes, potentially leading to a complete outage. These kinds of outages are not unheard of and have occurred in the real world. Even a simple deployment, where one node is removed and then re-added to the cluster, can cause this kind of outage.
Thundering Herd: Generating Excessing Back-End Load
This is another cache failure scenario that is often seen when discussing caches and their failure modes. Consider a popular website where many users access the same information, like a news article. This article is often stored in cache to speed up retrieval times. Everything works as expected until it doesn’t. Let’s say a cache entry is serving thousands of people successfully, but it reaches its expiry. Suddenly, all of these users are now sent to the database to have their queries served. This causes an overload on the backend and leads to slowdowns, errors, and even a complete crash. This is commonly known as the thundering herd problem. Increased load in the backend can lead to cascading failures for other dependent services that rely on everything behaving correctly.
Cache Coherence: Inconsistent Data in Cache
In large-scale systems, caches are often distributed across multiple nodes and levels, such as L1 (in-memory) and L2 (source database), to serve user requests. However, maintaining consistency across these nodes for a given user can be challenging due to varying access patterns and data requirements. Caching inappropriate data, such as frequently changing data, negates the benefits of caching and leads to increased overhead and cache misses. This can create unreliable behavior, where different applications using different nodes process and generate inconsistent results.
Forever Caches: Evicting and Invalidating/Expiring Cache Entries
The story of caching is incomplete without discussing cache eviction and invalidation. Removing entries from the cache is as important as adding them. Common cache eviction techniques include Least Recently Used (LRU), Least Frequently Used (LFU), and FIFO. Cache eviction typically occurs when the cache is at capacity. However, consider a scenario where the cache is full, and you remove a frequently used data point. This removal will lead to a cache miss and cause performance regression through the thundering herd effect. Another possibility is thrashing, where the same data is repeatedly added and removed due to conflicts with the selected policy, leading to cache performance degradation.
On the other hand, invalidation is necessary when data becomes stale. Common invalidation techniques are time-based or event-driven expiration.
A common failure with invalidation occurs when it is implemented improperly, leading to service outages. For example, if you haven't implemented any cache entry removal (expiration or eviction), your cache will grow in size and eventually crash the cache node. This not only increases the load on the backend but also on other cache nodes, potentially triggering the Doom Loop I explained above. People often forget this when implementing caches. I have seen production systems operate for years before this behavior was discovered, leading to multi-hour outages. As someone aptly said, "Everything happens for a reason."
While I am sure these are not the only cache failure modes, I want to focus on these for now. Let's discuss some practical strategies to resolve these failure modes and make caches useful for high-volume production use cases.
Practical Approaches
Intermediatory Caches: L2 Caching Layer
In the case of caches, two is better than one. The idea is to introduce an intermediate layer between the in-memory and the back-end database. This middle cache is highly available and decoupled from the back-end database. If a cache miss occurs, it first checks this middle layer (let's call it L2), and only if the data is not present there does it hit L3 (the backend). This removes the direct coupling between the application and back-end databases and simplifies deployment; when cache node eviction occurs, it falls back to L2 instead of the backend itself. Because this is separate from the main application, it can have a dedicated setup and more relaxed memory constraints for better performance.
However, as you can now see, you have a new caching fleet to maintain. This increases costs, and the system now has a new mode of operation to deal with. All the best practices mentioned in the previous articles should be considered when dealing with failures here.
Request Coalescing
In the previous section, I discussed introducing an L2 caching layer. While helpful, it can still suffer from a thundering herd problem and needs safeguards to prevent overload. The application using the cache can also help mitigate this. Application code can employ a technique called request coalescing, where multiple similar requests are combined to reduce pressure on the backend. In this technique, only one request among the similar ones is made, and the data returned is shared with all the common calls.
There are multiple ways to implement request coalescing. One popular approach is to introduce a queue where requests are staged, and a single processor thread makes the back-end call and distributes the response to all the waiting requests. Locking is another widely adopted technique.
Sensible Caching
The success of a caching solution is measured by its cache hit rate. To increase the chances of success, you can pre-warm the cache with frequently accessed data or data that has a high read-to-write ratio. This reduces the load on the backend and is commonly referred to as a solution to the cold cache problem. Another useful technique is negative caching, where you cache data that is not present in the back-end database and update the application code to handle it. In this case, when a query is made to get data for a particular record that is not present in the cache or the backend, it can be served with a "no data" response without querying the backend. Additionally, well-researched caching strategies such as "write-through cache," "write-back cache," and "read-through cache" should be considered based on the application's needs.
Proper Cache Eviction and Invalidation/Expiration
Core to a performant cache is how eviction and record invalidation are handled. Eviction is necessary to maintain the right cache size. The Eviction Policy decides which items are removed from the cache when it reaches capacity. For example, if temporal locality is important, use LRU; if you have a consistent access pattern, use LFU.
Invalidation/Expiration determines how long to keep a cache in memory. The most common technique to set expiration is through absolute time-based expiration (setting time to live (TTL)). Setting the appropriate TTL is important to avoid mass eviction or eviction of a hotkey, leading to a stampede effect. You can jitter the TTL to avoid concurrent eviction. Generally, TTL is set based on the staleness sensitivity of the application relying on it.
While preparation is essential, nothing compares to real-world data. Start with reasonable settings, but then rely on metrics and alarms to measure cache performance and tune it to production needs.
Caches Will Fail: It’s a Matter of When, Not If
Instead of only preparing to deal with cache failure after it happens, let’s bake failure mitigation into the design itself. You can start with the assumption that cache will eventually fail, and your system should be able to deal with it. Let’s take the Thundering Herd scenario explained above as an example. To prepare for it, I suggest developers introduce a ~10% (or as they see appropriate) failure rate in cache query requests and observe how their system behaves. It could be a canary that, at regular intervals, causes this to occur and forces the system to exercise this failure mode. This helps highlight both Thundering Herd and Doom Loop failure modes. This is often overlooked and is the most powerful tool in the developer toolset. Monitoring caches: Analyze cache performance on a regular basis and change strategies to make it more dynamic.
Conclusion
Caching can bring much-needed performance and scalability to distributed systems, but it requires critical thinking and strategic planning. Understanding the potential pitfalls of caching will enable a developer to optimize their systems effectively while reducing risks. By implementing strategic caching, strong invalidation, consistency mechanisms, and comprehensive monitoring, developers can create truly efficient and resilient distributed systems.
