Caching plays important role while designing low latency, high throughput systems. When you want to process a request, you want to avoid I/O operations as much as possible. Reason being it takes time to do that — Disk seeks are expensive, network calls are expensive. Typical times (Source)
Send 2K bytes over 1 Gbps network — 20, 000 ns
Read 1MB sequentially from memory — 250,000 ns
Read 1MB sequentially from disk — 20,000,000 ns
One should always keep the scale of data in mind while caching. My thought process while thinking about caching is as follow.
Caching can cause inconsistencies and it should be avoided as much as possible. Check if system meets Non-Functional requirements (NFR), latency & throughput, without caching. If you do, don’t add cache in system. If you don’t, check if requirements are correct and does stringent NFRs are really needed. For example:
(a) If someone says you need to fetch data for 30 items in under 10 ms — that may not be realistic. If the latency is 100 ms- it may be reasonable.
(b) Many times calls are made in parallel to various systems. Consider a system which calls service A and B, then computes output merging there results. If someone says A latency should be 100 ms whereas for B it should be 200ms. And your system works well without cache for 150 ms. You should try to get reasoning behind 100ms number (you should always question numbers which come out of magic hat). If there is no reasoning, it should be changed to 200 ms and caching should not be done.
One safe caching approach can be to use central cache store. It makes sure that all your app instances read same value (for simplicity, lets ignore replication delay inside cache store). It is also easy to update value, as you only need to update in central cache. But depending on data and size which we need to cache, we can take few different approaches.
(a) Few MBs of immutable data or slow changing data
For this case, one may consider caching in app memory. Use preloading at run time or doing a lazy load during first read. For immutable data, writes need not be propagated. Also, it reduces one more component to maintain. To update, entities may be expired after some time so that new data can be loaded. For example: lets say you want to cache name of countries or pins in a country. Or list of players playing in particular match.
(b) Hundreds of MBs of immutable or slow changing data
For slow changing data, you should not keep data in memory and should probably move to data store. But you could keep data in same box as service, i.e. have application and cache data store present in same machine. It reduces the network latency at cost of stale data and increased complexity of writes (as you have to write to cache of all machines). But it may be worth exploring, if you need to fetch large data while serving requests. One strategy to propagate writes to cache can be to have listeners, on main store, which updates cache on all app machines. I will pick it only if central cache store doesn’t meet my requirement.
One more trade off is that whenever you add new instance, cache needs to be warmed up. So, it may take time for new box to perform efficiently.
(c ) In GBs
You should definitely go with central cache. If not, it may put restriction on types of machines you can use. If you are using cloud service, you may have to go for more expensive machines. Cache warm up time will also increase. Higher data size means more entities. Even if individual rate of change is more, cumulatively the rate of change may be significant.
(d) Data in different data centres
One cache data store per data centre may be useful in this case. If its only in one data centre, network latency will be a lot and may take away benefit of caching (as time to read from main data store may be less than the network latency).
With caches you usually have to think of eviction policies. If you are running out of space you should evict cache using size based policies. If data gets stale after certain time, use time based policies. You may not have to evict data if you have enough memory and data doesn’t get stale.
Generally LRU and LFU strategy is to select the entries which needs to removed.
(a) DB load at 7 pm everyday
At 7 pm every day, load to db increased. It was because time based cache policy was used. And it coincided with the time the site saw highest traffic. Moving cache reload to early morning when site had less traffic solved the issue.
Learning: If you are using time based eviction. Make sure you do it at time site has less traffic
(b) Users unable to update profile
One case which I heard was of people caching the user data shown at profile page. When people updated their info, and went to profile page again, they saw the same data. And they tried again, it was same result. Though it was updated in DB, due to caching of users data were not able to see updated result.
Learning: You should be careful with updates and where you should cache data
P.S: It is hash of things which I have learnt over time. Do let me know in case I have missed any reference, will add it