I was going through one my favorite Engineering newsletter, The Pragmatic Engineer by Gergely Orsoz and ran across one of his older posts on Shopify making their app 20% faster by caching on the backend.
Case Study
Shopify "built their app on top of React Native and noticed how the home page was starting to become slow on loading. This was because the Rails backend made multiple database queries, reducing latency". The solution to this was using write-through caching reducing the database load by 15% and overall app latency by about 20%.
Understanding the problem
To re-state the problem, 'The main screen of the Shop app is the most used feature. Serving the feed is complex as it requires aggregating orders from millions of Shopify and non-Shopify merchants in addition to handling tracking data from dozens of carriers. Due to a variety of factors, loading and emerging this data is both computationally expensive and quite slow".
Instead of just reading and regurgitating this issue. I decided to replicate this problem to a smaller degree to get a better understanding of caching data using the cache-aside pattern and what effects different caching pattern, optimization strategies and invalidation rules have on latency and possible trade-offs that emerge within my limited use case.
My Case Study
For my case study I built a basic React Native and Rails app. I wanted to build a very simplistic version of the Shopify case study. In the backend I setup a CRUD endpoint enabling the creation, updating, deletion and reading of data. I integrated highlight.io for monitoring and logging. I chose this tool specifically because, having used this in previous projects, I liked the session replay and performance tracing functionality, enabling more detailed monitoring. Session Replay specifically enables me to visually replay issues a user encounters when experiencing usability issues allowing for greater ease in replicating bugs/issues.
Stress Testing
I implemented a script to simulate a relatively high load, concurrently executing reading, writing and updating data. Because of the limitations of my machine, I simulating 1000 individual actions/requests being made to the application consecutively. This is carried out over 10 minutes. Simply put, the script completes as many of these requests over a 10 minute period. This is not even half as high volume as the Shopify case study, but I deemed this an adequate starting point for my MVP case study.
namespace :high_load do
desc 'Run high load simulation for reads, writes, and updates'
task :simulate => :environment do
...
read_endpoint = URI.parse('http://127.0.0.1:3000/items')
write_endpoint = URI.parse('http://127.0.0.1:3000/write_items')
...
iterations = 1000
max_retries = 2
end_time = Time.now + 10 * 60
default_memcached_memory_limit_mb = 64
cache_capacity = default_memcached_memory_limit_mb * 1024 * 1024
cache_size = 125.89 * 1024 * 1024
while Time.now < end_time
remaining_time = (end_time - Time.now).to_i
minutes_left = remaining_time / 60
puts "Time left: #{minutes_left} minutes"
iterations.times do
retry_count = 0
success = false
combined_latency = 0
until success || retry_count >= max_retries
begin
combined_latency = Benchmark.realtime do
http = Net::HTTP.new(read_endpoint.host, read_endpoint.port)
http.read_timeout = 120
response = http.get(read_endpoint.request_uri)
if response.code == '200'
items = JSON.parse(response.body)
success = true
cache_hits += 1
else
read_error_count += 1
cache_misses += 1
end
end
rescue Net::ReadTimeout
retry_count += 1
puts "Read request timed out. Retrying (Attempt #{retry_count}/#{max_retries})..."
sleep(2)
end
end
if !success
else
Rails.cache.delete('items/all')
cache_invalidation_count += 1
combined_latencies << combined_latency
end
end
iterations.times do
retry_count = 0
success = false
combined_latency = 0
until success || retry_count >= max_retries
begin
combined_latency = Benchmark.realtime do
http = Net::HTTP.new(write_endpoint.host, write_endpoint.port)
http.read_timeout = 60
item = Item.new(
name: Faker::Commerce.product_name,
description: Faker::Lorem.sentence,
price: Faker::Commerce.price(range: 0..100.0)
)
if item.save
puts "Item created: #{item.name}"
success = true
cache_evictions += 1
else
write_error_count += 1
end
end
rescue Net::ReadTimeout
retry_count += 1
sleep(2)
end
end
if !success
else
Rails.cache.delete('items/all')
cache_invalidation_count += 1
combined_latencies << combined_latency
end
end
iterations.times do
retry_count = 0
success = false
combined_latency = 0
until success || retry_count >= max_retries
begin
combined_latency = Benchmark.realtime do
random_item = Item.offset(rand(Item.count)).first
if random_item.update(
name: Faker::Commerce.product_name,
description: Faker::Lorem.sentence,
price: Faker::Commerce.price(range: 0..100.0)
)
puts "Item updated: #{random_item.name}"
success = true
cache_evictions += 1
else
update_error_count += 1
puts "Error updating item: #{random_item.errors.full_messages.join(', ')}"
end
end
rescue Net::ReadTimeout
retry_count += 1
puts "Update request timed out. Retrying (Attempt #{retry_count}/#{max_retries})..."
sleep(2)
end
end
if !success
puts "Update request failed after #{max_retries} attempts. Skipping..."
else
Rails.cache.delete('items/all')
cache_invalidation_count += 1
combined_latencies << combined_latency
end
end
query_time = Benchmark.realtime do
Item.first
end
database_query_times << query_time
sleep 1
end
I set estimates for metrics I believed would best help me understand the behavior of different cache patterns and invalidation rules in my case study. I specifically opted to monitor;
Average Combined Latency: This is the average latency of all actions carried out in the simulation. I monitored this through two methods; through the monitoring tool; highlight.io and using the `Benchmark.realtime` method. Benchmark is a module that 'provides methods to measure and report the time used to execute Ruby code'. Latency ultimately measures the time it takes for an application to respond to a request, with high response times indicating an application experience latency.
Average Database Query Time: I chose to measure the time it would take to perform a database query to better understand the performance of my database operations. The Shopify blog post made mention to looking for simple solutions like 'updating our database schema and adding more database indexes'. Further research indicated that many factors could slow down database queries including, but not limited to; an un-optimized database schema and improper indexing. Since latency can be thought of as an end-to-end measure for the request lifecycle, database querying in a possible bottleneck that could drive up latency
Cache Hit Rate: Cache hit rate, 'is a measure of the effectiveness of a cache. The cache hit rate is the percentage of requests for data that can be served by the cache, rather than having to be retrieved from the origin server'. What does this even mean right? Enter GPT. After probing GPT multiple times, the best definition I got was. 'the percentage of requests served from the cache. This directly impacts application performance and scalability. A high cache hit rate indicates that most requests are served quickly from the cache, reducing database load and improving response times.' I calculated this by tracking cache hits, total requests and dividing cache requests by total requests.
cache_hit_rate = (cache_hits.to_f / total_requests) * 100
Cache Miss Rate: This measures "The number of total cache misses divided by the total number of memory requests made to the case.". According to Redis, a cache miss refers to a state where data is requested by a component and not found in the cache. Since the cache is designed to expedite data retrieval over database queries, when the cache miss occurs the system fetches the requested data from main memory. This can lead to a performance bottleneck. I measured this by tracking cache misses, total requests and dividing cache misses by total requests.
cache_miss_rate = (cache_misses.to_f / total_requests) * 100
Cache Eviction Rate: "Eviction is when a cache write causes the aggregate data in a cache to exceed the available memory storage, and the cache must remove some data to make room" This measures how frequently items are removed from the cache. This could happen as a result of cache size limits or time to live expiration amongst other factors. Why is this important? A high eviction rate could be an indicator that the size of my cache is too small or my Time to live settings have not been adequately optimized, resulting in more frequent cache misses, affecting overall performance.
I chose to break this down to understand the causes of evictions, to get more granular understanding of why the evictions were occurring allowing for more accurate optimization
cache_eviction_rate = (cache_evictions.to_f / total_requests) * 100
cache_evictions_due_to_size = cache_evictions * (cache_capacity / cache_size)
cache_evictions_due_to_expiration = (cache_invalidation_count * (cache_capacity / cache_size)) - cache_evictions_due_to_size
Cache patterns
As a first option I opted to implement a cache-aside pattern. This is the most common caching pattern. According to the AWS White Paper on `Caching Patterns`:
The fundamental data retrieval logic can be summarized as follows: 1. When you application needs to read data from the database, it checks the cache first to determine whether the data is available. 2. If the data is available (a cache hit), a cached data is returned, and the response is issued to the caller. 3. If the data isn't available (a cache miss), the database is queried for the data. The cache is then populated with the data that is retrieved from the database, and the data is returned to the caller.
This approach ensures that cache only contains data that the application actually requests making it very cost effective. It is also a straight forward approach and as the saying goes K.I.S.S.
Implementation:
class ItemsController < ApplicationController
def index
cache_key = "items/all"
items = Rails.cache.fetch(cache_key, expires_in: 5.minutes) do
Item.all.to_a
end
render json: items
end
def create
item = Item.new(item_params)
if item.save
Rails.cache.delete("items/all")
render json: item, status: :created
else
render json: item.errors, status: :unprocessable_entity
end
end
def update
item = Item.find(params[:id])
if item.update(item_params)
Rails.cache.delete("items/all")
render json: item
else
render json: item.errors, status: :unprocessable_entity
end
end
def destroy
item = Item.find(params[:id])
if item.destroy
Rails.cache.delete("items/all")
head :no_content
else
render json: { error: "Item could not be deleted" }, status: :unprocessable_entity
end
end
private
def item_params
params.require(:item).permit(:name, :description)
end
end
Results:
Pre-cache implementation
Post Cache Implementation:
Results fromHighlight.io:
From implementing cache aside, we already see an improvement in Latency. highlight.io specifically reports p90, p50 and average latency. P90 signifies that 90% of requests are completed within the given latency value, while the remaining 10% took longer. Without context of comparison, the other cache metrics don't allow for much comparison at this point.
GitHub commit:
https://github.com/EmmS21/shopifycasestudy/commit/fe719b7e46e4800f82ba7074b91c4ffaf4e03ef6
Optimization: Granular caching strategy and Database indexing
To further optimize my cache, I asked myself what the most simple optimization strategy was given the caching pattern I had already implemented. The Database indexing came to mind. Although the database query time didn't appear to be causing any major bottlenecks I wanted to understand what effect this would have on this metric.
What is Database indexing: "an index maps search keys to corresponding data on disk by using different in-memory and on-disk data structures. Index is used to quicken the search by reducing the number of records to search for"
This means implementing an index that can be used to easily locate data in a database by one or multiple columns. This however, has it's drawbacks. If for example I were to insert or update data, this would mean the index itself would need to be updated. Therefore it is very possible that this would actually slow down database query times since in my instance my simulation is not only carrying out read requests, but also insertion and updating.
More granular caching.
I was initially using one generic cache key `items/all` to carry out any and all tasks related to reading, updating or inserting data in my cache. I opted to implement a more granular approach creating keys for each request. I reasoned that with this approach I would be creating unique cache keys associated with each unique action. This would enable me to access entries in my cache based on it's key possibly leading to greater efficiency in cache invalidation and reduced memory waste. If I were to update my cache for example, only the portion of the cache with keys associated with the update key would be invalidated and refreshed as opposed to the entire cache
Implementation
class ItemsController < ApplicationController
before_action :set_cache_key, only: [:index]
before_action :set_item_cache_key, only: [:create, :update, :destroy]
def index
items = Rails.cache.fetch(@cache_key, expires_in: 5.minutes) do
Item.where(category: params[:category]).to_a
end
render json: items
end
def create
item = Item.new(item_params)
if item.save
Rails.cache.delete(@item_cache_key)
render json: item, status: :created
else
render json: item.errors, status: :unprocessable_entity
end
end
def update
item = Item.find(params[:id])
if item.update(item_params)
Rails.cache.delete(@item_cache_key)
render json: item
else
render json: item.errors, status: :unprocessable_entity
end
end
def destroy
item = Item.find(params[:id])
if item.destroy
Rails.cache.delete(@item_cache_key)
head :no_content
else
render json: { error: "Item could not be deleted" }, status: :unprocessable_entity
end
end
private
def item_params
params.require(:item).permit(:name, :description, :category)
end
def set_cache_key
@cache_key = "items/#{params[:category]}/all"
end
def set_item_cache_key
@item_cache_key = "items/#{params[:category]}/all"
end
end
Results:
Git Commit: https://github.com/EmmS21/shopifycasestudy/commit/4fb1270baaaf6a4589080e9e0b1dc3dcc335f1a7
No changes?
Before delving into other steps I carried out. I realized at this point that the optimization techniques would likely not yield any tangible results as I was likely not simulating a high enough load to actually start observing the benefits of caching and the optimization I implemented. Stay tuned for round two, of this case study.