I was going through one my favorite Engineering newsletter, The Pragmatic Engineer by Gergely Orsoz and ran across one of his older posts on . Shopify making their app 20% faster by caching on the backend Case Study Shopify " The solution to this was using write-through caching reducing the database load by 15% and overall app latency by about 20%. built their app on top of React Native and noticed how the home page was starting to become slow on loading. This was because the Rails backend made multiple database queries, reducing latency". Link to Article Understanding the problem To re-state the problem, . 'The main screen of the Shop app is the most used feature. Serving the feed is complex as it requires aggregating orders from millions of Shopify and non-Shopify merchants in addition to handling tracking data from dozens of carriers. Due to a variety of factors, loading and emerging this data is both computationally expensive and quite slow" Instead of just reading and regurgitating this issue. I decided to replicate this problem to a smaller degree to get a better understanding of caching data using the cache-aside pattern and what effects different caching pattern, optimization strategies and invalidation rules have on latency and possible trade-offs that emerge within my limited use case. My Case Study For my case study I built a basic React Native and Rails app. I wanted to build a very simplistic version of the Shopify case study. In the backend I setup a CRUD endpoint enabling the creation, updating, deletion and reading of data. I integrated for monitoring and logging. I chose this tool specifically because, having used this in previous projects, I liked the session replay and performance tracing functionality, enabling more detailed monitoring. Session Replay specifically enables me to visually replay issues a user encounters when experiencing usability issues allowing for greater ease in replicating bugs/issues. highlight.io Stress Testing I implemented a script to simulate a relatively high load, concurrently executing reading, writing and updating data. Because of the limitations of my machine, I simulating 1000 individual actions/requests being made to the application consecutively. This is carried out over 10 minutes. Simply put, the script completes as many of these requests over a 10 minute period. This is not even half as high volume as the Shopify case study, but I deemed this an adequate starting point for my MVP case study. namespace :high_load do desc 'Run high load simulation for reads, writes, and updates' task :simulate => :environment do ... read_endpoint = URI.parse('http://127.0.0.1:3000/items') write_endpoint = URI.parse('http://127.0.0.1:3000/write_items') ... iterations = 1000 max_retries = 2 end_time = Time.now + 10 * 60 default_memcached_memory_limit_mb = 64 cache_capacity = default_memcached_memory_limit_mb * 1024 * 1024 cache_size = 125.89 * 1024 * 1024 while Time.now < end_time remaining_time = (end_time - Time.now).to_i minutes_left = remaining_time / 60 puts "Time left: #{minutes_left} minutes" iterations.times do retry_count = 0 success = false combined_latency = 0 until success || retry_count >= max_retries begin combined_latency = Benchmark.realtime do http = Net::HTTP.new(read_endpoint.host, read_endpoint.port) http.read_timeout = 120 response = http.get(read_endpoint.request_uri) if response.code == '200' items = JSON.parse(response.body) success = true cache_hits += 1 else read_error_count += 1 cache_misses += 1 end end rescue Net::ReadTimeout retry_count += 1 puts "Read request timed out. Retrying (Attempt #{retry_count}/#{max_retries})..." sleep(2) end end if !success else Rails.cache.delete('items/all') cache_invalidation_count += 1 combined_latencies << combined_latency end end iterations.times do retry_count = 0 success = false combined_latency = 0 until success || retry_count >= max_retries begin combined_latency = Benchmark.realtime do http = Net::HTTP.new(write_endpoint.host, write_endpoint.port) http.read_timeout = 60 item = Item.new( name: Faker::Commerce.product_name, description: Faker::Lorem.sentence, price: Faker::Commerce.price(range: 0..100.0) ) if item.save puts "Item created: #{item.name}" success = true cache_evictions += 1 else write_error_count += 1 end end rescue Net::ReadTimeout retry_count += 1 sleep(2) end end if !success else Rails.cache.delete('items/all') cache_invalidation_count += 1 combined_latencies << combined_latency end end iterations.times do retry_count = 0 success = false combined_latency = 0 until success || retry_count >= max_retries begin combined_latency = Benchmark.realtime do random_item = Item.offset(rand(Item.count)).first if random_item.update( name: Faker::Commerce.product_name, description: Faker::Lorem.sentence, price: Faker::Commerce.price(range: 0..100.0) ) puts "Item updated: #{random_item.name}" success = true cache_evictions += 1 else update_error_count += 1 puts "Error updating item: #{random_item.errors.full_messages.join(', ')}" end end rescue Net::ReadTimeout retry_count += 1 puts "Update request timed out. Retrying (Attempt #{retry_count}/#{max_retries})..." sleep(2) end end if !success puts "Update request failed after #{max_retries} attempts. Skipping..." else Rails.cache.delete('items/all') cache_invalidation_count += 1 combined_latencies << combined_latency end end query_time = Benchmark.realtime do Item.first end database_query_times << query_time sleep 1 end I set estimates for metrics I believed would best help me understand the behavior of different cache patterns and invalidation rules in my case study. I specifically opted to monitor; This is the average latency of all actions carried out in the simulation. I monitored this through two methods; through the monitoring tool; and using the `Benchmark.realtime` method. is a module that ' ' ultimately measures the time it takes for an application to respond to a request, with high response times indicating an application experience latency. Average Combined Latency: highlight.io Benchmark provides methods to measure and report the time used to execute Ruby code . Latency I chose to measure the time it would take to perform a database query to better understand the performance of my database operations. The Shopify blog post made mention to looking for simple solutions like ' . Further research indicated that many factors could slow down database queries including, but not limited to; an un-optimized database schema and improper indexing. Since latency can be thought of as an end-to-end measure for the request lifecycle, database querying in a possible bottleneck that could drive up latency Average Database Query Time: updating our database schema and adding more database indexes' : Cache hit rate, What does this even mean right? Enter GPT. After probing GPT multiple times, the best definition I got was. ' I calculated this by tracking cache hits, total requests and dividing cache requests by total requests. Cache Hit Rate 'i . s a measure of the effectiveness of a cache. The cache hit rate is the percentage of requests for data that can be served by the cache, rather than having to be retrieved from the origin server' the percentage of requests served from the cache. This directly impacts application performance and scalability. A high cache hit rate indicates that most requests are served quickly from the cache, reducing database load and improving response times.' cache_hit_rate = (cache_hits.to_f / total_requests) * 100 This measures According to , a cache miss refers to a state where data is requested by a component and not found in the cache. Since the cache is designed to expedite data retrieval over database queries, when the cache miss occurs the system fetches the requested data from main memory. This can lead to a performance bottleneck. I measured this by tracking cache misses, total requests and dividing cache misses by total requests. Cache Miss Rate: " . The number of total cache misses divided by the total number of memory requests made to the case." Redis cache_miss_rate = (cache_misses.to_f / total_requests) * 100 " " This measures how frequently items are removed from the cache. This could happen as a result of cache size limits or time to live expiration amongst other factors. Why is this important? A high eviction rate could be an indicator that the size of my cache is too small or my Time to live settings have not been adequately optimized, resulting in more frequent cache misses, affecting overall performance. Cache Eviction Rate: Eviction is when a cache write causes the aggregate data in a cache to exceed the available memory storage, and the cache must remove some data to make room I chose to break this down to understand the causes of evictions, to get more granular understanding of why the evictions were occurring allowing for more accurate optimization cache_eviction_rate = (cache_evictions.to_f / total_requests) * 100 cache_evictions_due_to_size = cache_evictions * (cache_capacity / cache_size) cache_evictions_due_to_expiration = (cache_invalidation_count * (cache_capacity / cache_size)) - cache_evictions_due_to_size This is important as it helps me understand how often data in my cache becomes stale/outdated. This is essential for data consistency as high cache invalidation frequency could indicate that my data is frequently becoming outdated, potentially leading to greater inconsistency. Additionally, this has a direct impact on the cache hit rate as more frequent invalidation means more data needs to be re-fetched from the backend. Cache Invalidation Frequency: " ". Cache invalidation is a process that ensures the data data stored in a cache remains current and consistent with the original data source. This process is necessary to maintain data accuracy, system integrity and optimal performance, as outdated or incorrect cache data can lead to errors and inefficiencies Cache patterns As a first option I opted to implement a cache-aside pattern. This is the most common caching pattern. According to the `: AWS White Paper on `Caching Patterns The fundamental data retrieval logic can be summarized as follows: 1. When you application needs to read data from the database, it checks the cache first to determine whether the data is available. 2. If the data is available (a cache hit), a cached data is returned, and the response is issued to the caller. 3. If the data isn't available (a cache miss), the database is queried for the data. The cache is then populated with the data that is retrieved from the database, and the data is returned to the caller. This approach ensures that cache only contains data that the application actually requests making it very cost effective. It is also a straight forward approach and as the saying goes . K.I.S.S Implementation: class ItemsController < ApplicationController def index cache_key = "items/all" items = Rails.cache.fetch(cache_key, expires_in: 5.minutes) do Item.all.to_a end render json: items end def create item = Item.new(item_params) if item.save Rails.cache.delete("items/all") render json: item, status: :created else render json: item.errors, status: :unprocessable_entity end end def update item = Item.find(params[:id]) if item.update(item_params) Rails.cache.delete("items/all") render json: item else render json: item.errors, status: :unprocessable_entity end end def destroy item = Item.find(params[:id]) if item.destroy Rails.cache.delete("items/all") head :no_content else render json: { error: "Item could not be deleted" }, status: :unprocessable_entity end end private def item_params params.require(:item).permit(:name, :description) end end Results: Pre-cache implementation Post Cache Implementation: Results from : Highlight.io Improvement in Latency From implementing cache aside, we already see an improvement in Latency. specifically reports p90, p50 and average latency. P90 signifies that 90% of requests are completed within the given latency value, while the remaining 10% took longer. Without context of comparison, the other cache metrics don't allow for much comparison at this point. highlight.io GitHub commit: https://github.com/EmmS21/shopifycasestudy/commit/fe719b7e46e4800f82ba7074b91c4ffaf4e03ef6 Optimization: Granular caching strategy and Database indexing To further optimize my cache, I asked myself what the most simple optimization strategy was given the caching pattern I had already implemented. The Database indexing came to mind. Although the database query time didn't appear to be causing any major bottlenecks I wanted to understand what effect this would have on this metric. " " What is Database indexing: an index maps search keys to corresponding data on disk by using different in-memory and on-disk data structures. Index is used to quicken the search by reducing the number of records to search for This means implementing an index that can be used to easily locate data in a database by one or multiple columns. This however, has it's drawbacks. If for example I were to insert or update data, this would mean the index itself would need to be updated. Therefore it is very possible that this would actually slow down database query times since in my instance my simulation is not only carrying out read requests, but also insertion and updating. More granular caching. I was initially using one generic cache key `items/all` to carry out any and all tasks related to reading, updating or inserting data in my cache. I opted to implement a more granular approach creating keys for each request. I reasoned that with this approach I would be creating unique cache keys associated with each unique action. This would enable me to access entries in my cache based on it's key possibly leading to greater efficiency in cache invalidation and reduced memory waste. If I were to update my cache for example, only the portion of the cache with keys associated with the update key would be invalidated and refreshed as opposed to the entire cache Implementation class ItemsController < ApplicationController before_action :set_cache_key, only: [:index] before_action :set_item_cache_key, only: [:create, :update, :destroy] def index items = Rails.cache.fetch(@cache_key, expires_in: 5.minutes) do Item.where(category: params[:category]).to_a end render json: items end def create item = Item.new(item_params) if item.save Rails.cache.delete(@item_cache_key) render json: item, status: :created else render json: item.errors, status: :unprocessable_entity end end def update item = Item.find(params[:id]) if item.update(item_params) Rails.cache.delete(@item_cache_key) render json: item else render json: item.errors, status: :unprocessable_entity end end def destroy item = Item.find(params[:id]) if item.destroy Rails.cache.delete(@item_cache_key) head :no_content else render json: { error: "Item could not be deleted" }, status: :unprocessable_entity end end private def item_params params.require(:item).permit(:name, :description, :category) end def set_cache_key @cache_key = "items/#{params[:category]}/all" end def set_item_cache_key @item_cache_key = "items/#{params[:category]}/all" end end Results: Git Commit: https://github.com/EmmS21/shopifycasestudy/commit/4fb1270baaaf6a4589080e9e0b1dc3dcc335f1a7 No changes? Before delving into other steps I carried out. I realized at this point that the optimization techniques would likely not yield any tangible results as I was likely not simulating a high enough load to actually start observing the benefits of caching and the optimization I implemented. Stay tuned for round two, of this case study.