Shopify Case Study: Understanding Caching Patterns Using a Simple MVP

I was going through one my favorite Engineering newsletter, The Pragmatic Engineer by Gergely Orsoz and ran across one of his older posts on Shopify making their app 20% faster by caching on the backend.

Case Study

Shopify "built their app on top of React Native and noticed how the home page was starting to become slow on loading. This was because the Rails backend made multiple database queries, reducing latency". The solution to this was using write-through caching reducing the database load by 15% and overall app latency by about 20%.

Link to Article

Understanding the problem

To re-state the problem, 'The main screen of the Shop app is the most used feature. Serving the feed is complex as it requires aggregating orders from millions of Shopify and non-Shopify merchants in addition to handling tracking data from dozens of carriers. Due to a variety of factors, loading and emerging this data is both computationally expensive and quite slow".

Instead of just reading and regurgitating this issue. I decided to replicate this problem to a smaller degree to get a better understanding of caching data using the cache-aside pattern and what effects different caching pattern, optimization strategies and invalidation rules have on latency and possible trade-offs that emerge within my limited use case.

My Case Study

For my case study I built a basic React Native and Rails app. I wanted to build a very simplistic version of the Shopify case study. In the backend I setup a CRUD endpoint enabling the creation, updating, deletion and reading of data. I integrated highlight.io for monitoring and logging. I chose this tool specifically because, having used this in previous projects, I liked the session replay and performance tracing functionality, enabling more detailed monitoring. Session Replay specifically enables me to visually replay issues a user encounters when experiencing usability issues allowing for greater ease in replicating bugs/issues.

Stress Testing

I implemented a script to simulate a relatively high load, concurrently executing reading, writing and updating data. Because of the limitations of my machine, I simulating 1000 individual actions/requests being made to the application consecutively. This is carried out over 10 minutes. Simply put, the script completes as many of these requests over a 10 minute period. This is not even half as high volume as the Shopify case study, but I deemed this an adequate starting point for my MVP case study.

namespace :high_load do
    desc 'Run high load simulation for reads, writes, and updates'
    task :simulate => :environment do  
      ...  
      read_endpoint = URI.parse('http://127.0.0.1:3000/items')
      write_endpoint = URI.parse('http://127.0.0.1:3000/write_items')  
      ...  
      iterations = 1000
      max_retries = 2  
      end_time = Time.now + 10 * 60
      default_memcached_memory_limit_mb = 64
      cache_capacity = default_memcached_memory_limit_mb * 1024 * 1024   
      cache_size = 125.89 * 1024 * 1024  
      while Time.now < end_time  
        remaining_time = (end_time - Time.now).to_i
        minutes_left = remaining_time / 60
        puts "Time left: #{minutes_left} minutes"  
        iterations.times do
          retry_count = 0
          success = false
          combined_latency = 0
          until success || retry_count >= max_retries
            begin
              combined_latency = Benchmark.realtime do 
                http = Net::HTTP.new(read_endpoint.host, read_endpoint.port)
                http.read_timeout = 120   
                response = http.get(read_endpoint.request_uri)
                if response.code == '200'
                  items = JSON.parse(response.body)
                  success = true
                  cache_hits += 1
                else
                  read_error_count += 1
                  cache_misses += 1
                end
              end
            rescue Net::ReadTimeout
              retry_count += 1
              puts "Read request timed out. Retrying (Attempt #{retry_count}/#{max_retries})..."
              sleep(2) 
            end
          end
          if !success
          else
            Rails.cache.delete('items/all') 
            cache_invalidation_count += 1 
            combined_latencies << combined_latency 
          end
        end 
        iterations.times do
          retry_count = 0
          success = false
          combined_latency = 0
          until success || retry_count >= max_retries
            begin
              combined_latency = Benchmark.realtime do 
                http = Net::HTTP.new(write_endpoint.host, write_endpoint.port)
                http.read_timeout = 60 
                item = Item.new(
                  name: Faker::Commerce.product_name,
                  description: Faker::Lorem.sentence,
                  price: Faker::Commerce.price(range: 0..100.0)
                )
                if item.save
                  puts "Item created: #{item.name}"
                  success = true
                  cache_evictions += 1
                else
                  write_error_count += 1
                end
              end
            rescue Net::ReadTimeout
              retry_count += 1
              sleep(2) 
            end
          end
          if !success
          else
            Rails.cache.delete('items/all') 
            cache_invalidation_count += 1 
            combined_latencies << combined_latency
          end
        end
        iterations.times do
          retry_count = 0
          success = false
          combined_latency = 0
          until success || retry_count >= max_retries
            begin
              combined_latency = Benchmark.realtime do 
                random_item = Item.offset(rand(Item.count)).first
                if random_item.update(
                  name: Faker::Commerce.product_name,
                  description: Faker::Lorem.sentence,
                  price: Faker::Commerce.price(range: 0..100.0)
                )
                  puts "Item updated: #{random_item.name}"
                  success = true
                  cache_evictions += 1
                else
                  update_error_count += 1
                  puts "Error updating item: #{random_item.errors.full_messages.join(', ')}"
                end
              end
            rescue Net::ReadTimeout
              retry_count += 1
              puts "Update request timed out. Retrying (Attempt #{retry_count}/#{max_retries})..."
              sleep(2) 
            end
          end
          if !success
            puts "Update request failed after #{max_retries} attempts. Skipping..."
          else
            Rails.cache.delete('items/all') 
            cache_invalidation_count += 1 
            combined_latencies << combined_latency
          end
        end
        query_time = Benchmark.realtime do
          Item.first
        end
        database_query_times << query_time
        sleep 1
      end

I set estimates for metrics I believed would best help me understand the behavior of different cache patterns and invalidation rules in my case study. I specifically opted to monitor;

Average Combined Latency: This is the average latency of all actions carried out in the simulation. I monitored this through two methods; through the monitoring tool; highlight.io and using the `Benchmark.realtime` method. Benchmark is a module that 'provides methods to measure and report the time used to execute Ruby code'. Latency ultimately measures the time it takes for an application to respond to a request, with high response times indicating an application experience latency.
Average Database Query Time: I chose to measure the time it would take to perform a database query to better understand the performance of my database operations. The Shopify blog post made mention to looking for simple solutions like 'updating our database schema and adding more database indexes'. Further research indicated that many factors could slow down database queries including, but not limited to; an un-optimized database schema and improper indexing. Since latency can be thought of as an end-to-end measure for the request lifecycle, database querying in a possible bottleneck that could drive up latency
Cache Hit Rate: Cache hit rate, 'is a measure of the effectiveness of a cache. The cache hit rate is the percentage of requests for data that can be served by the cache, rather than having to be retrieved from the origin server'. What does this even mean right? Enter GPT. After probing GPT multiple times, the best definition I got was. 'the percentage of requests served from the cache. This directly impacts application performance and scalability. A high cache hit rate indicates that most requests are served quickly from the cache, reducing database load and improving response times.' I calculated this by tracking cache hits, total requests and dividing cache requests by total requests.

cache_hit_rate = (cache_hits.to_f / total_requests) * 100

Cache Miss Rate: This measures "The number of total cache misses divided by the total number of memory requests made to the case.". According to Redis, a cache miss refers to a state where data is requested by a component and not found in the cache. Since the cache is designed to expedite data retrieval over database queries, when the cache miss occurs the system fetches the requested data from main memory. This can lead to a performance bottleneck. I measured this by tracking cache misses, total requests and dividing cache misses by total requests.

    cache_miss_rate = (cache_misses.to_f / total_requests) * 100

Cache Eviction Rate: "Eviction is when a cache write causes the aggregate data in a cache to exceed the available memory storage, and the cache must remove some data to make room" This measures how frequently items are removed from the cache. This could happen as a result of cache size limits or time to live expiration amongst other factors. Why is this important? A high eviction rate could be an indicator that the size of my cache is too small or my Time to live settings have not been adequately optimized, resulting in more frequent cache misses, affecting overall performance.
I chose to break this down to understand the causes of evictions, to get more granular understanding of why the evictions were occurring allowing for more accurate optimization

    cache_eviction_rate = (cache_evictions.to_f / total_requests) * 100
    cache_evictions_due_to_size = cache_evictions * (cache_capacity / cache_size)
    cache_evictions_due_to_expiration = (cache_invalidation_count * (cache_capacity / cache_size)) - cache_evictions_due_to_size

Cache Invalidation Frequency: "Cache invalidation is a process that ensures the data data stored in a cache remains current and consistent with the original data source. This process is necessary to maintain data accuracy, system integrity and optimal performance, as outdated or incorrect cache data can lead to errors and inefficiencies". This is important as it helps me understand how often data in my cache becomes stale/outdated. This is essential for data consistency as high cache invalidation frequency could indicate that my data is frequently becoming outdated, potentially leading to greater inconsistency. Additionally, this has a direct impact on the cache hit rate as more frequent invalidation means more data needs to be re-fetched from the backend.

Cache patterns

As a first option I opted to implement a cache-aside pattern. This is the most common caching pattern. According to the AWS White Paper on `Caching Patterns`:

The fundamental data retrieval logic can be summarized as follows: 1. When you application needs to read data from the database, it checks the cache first to determine whether the data is available. 2. If the data is available (a cache hit), a cached data is returned, and the response is issued to the caller. 3. If the data isn't available (a cache miss), the database is queried for the data. The cache is then populated with the data that is retrieved from the database, and the data is returned to the caller.

This approach ensures that cache only contains data that the application actually requests making it very cost effective. It is also a straight forward approach and as the saying goes K.I.S.S.

Implementation:

class ItemsController < ApplicationController
    def index
        cache_key = "items/all"
        items = Rails.cache.fetch(cache_key, expires_in: 5.minutes) do
            Item.all.to_a
        end
        render json: items
    end
    def create
      item = Item.new(item_params)
      if item.save
        Rails.cache.delete("items/all")
        render json: item, status: :created
      else
        render json: item.errors, status: :unprocessable_entity
      end
    end
    def update
      item = Item.find(params[:id])
      if item.update(item_params)
        Rails.cache.delete("items/all")
        render json: item
      else
        render json: item.errors, status: :unprocessable_entity
      end
    end
    def destroy
      item = Item.find(params[:id])
      if item.destroy
        Rails.cache.delete("items/all")
        head :no_content
      else
        render json: { error: "Item could not be deleted" }, status: :unprocessable_entity
      end
    end
    private
    def item_params
      params.require(:item).permit(:name, :description)
    end
  end

Results:

Pre-cache implementation

Post Cache Implementation:

Results fromHighlight.io:

Improvement in Latency

From implementing cache aside, we already see an improvement in Latency. highlight.io specifically reports p90, p50 and average latency. P90 signifies that 90% of requests are completed within the given latency value, while the remaining 10% took longer. Without context of comparison, the other cache metrics don't allow for much comparison at this point.

GitHub commit:

https://github.com/EmmS21/shopifycasestudy/commit/fe719b7e46e4800f82ba7074b91c4ffaf4e03ef6

Optimization: Granular caching strategy and Database indexing

To further optimize my cache, I asked myself what the most simple optimization strategy was given the caching pattern I had already implemented. The Database indexing came to mind. Although the database query time didn't appear to be causing any major bottlenecks I wanted to understand what effect this would have on this metric.

What is Database indexing: "an index maps search keys to corresponding data on disk by using different in-memory and on-disk data structures. Index is used to quicken the search by reducing the number of records to search for"

This means implementing an index that can be used to easily locate data in a database by one or multiple columns. This however, has it's drawbacks. If for example I were to insert or update data, this would mean the index itself would need to be updated. Therefore it is very possible that this would actually slow down database query times since in my instance my simulation is not only carrying out read requests, but also insertion and updating.

More granular caching.

I was initially using one generic cache key `items/all` to carry out any and all tasks related to reading, updating or inserting data in my cache. I opted to implement a more granular approach creating keys for each request. I reasoned that with this approach I would be creating unique cache keys associated with each unique action. This would enable me to access entries in my cache based on it's key possibly leading to greater efficiency in cache invalidation and reduced memory waste. If I were to update my cache for example, only the portion of the cache with keys associated with the update key would be invalidated and refreshed as opposed to the entire cache

Implementation

class ItemsController < ApplicationController
    before_action :set_cache_key, only: [:index]
    before_action :set_item_cache_key, only: [:create, :update, :destroy]
    def index
      items = Rails.cache.fetch(@cache_key, expires_in: 5.minutes) do
        Item.where(category: params[:category]).to_a
      end
      render json: items
    end
    def create
      item = Item.new(item_params)
      if item.save
        Rails.cache.delete(@item_cache_key) 
        render json: item, status: :created
      else
        render json: item.errors, status: :unprocessable_entity
      end
    end
    def update
      item = Item.find(params[:id])
      if item.update(item_params)
        Rails.cache.delete(@item_cache_key) 
        render json: item
      else
        render json: item.errors, status: :unprocessable_entity
      end
    end
    def destroy
      item = Item.find(params[:id])
      if item.destroy
        Rails.cache.delete(@item_cache_key) 
        head :no_content
      else
        render json: { error: "Item could not be deleted" }, status: :unprocessable_entity
      end
    end
    private
    def item_params
      params.require(:item).permit(:name, :description, :category)
    end
    def set_cache_key
      @cache_key = "items/#{params[:category]}/all"
    end
    def set_item_cache_key
      @item_cache_key = "items/#{params[:category]}/all"
    end
  end

Results:

Git Commit: https://github.com/EmmS21/shopifycasestudy/commit/4fb1270baaaf6a4589080e9e0b1dc3dcc335f1a7

No changes?

Before delving into other steps I carried out. I realized at this point that the optimization techniques would likely not yield any tangible results as I was likely not simulating a high enough load to actually start observing the benefits of caching and the optimization I implemented. Stay tuned for round two, of this case study.