Designing an API Rate Limiter

Internet-scale distributed systems implement an API rate limiter for high availability and security. Let’s break down the design components of this system.

Requirements

Throttle requests exceeding the rate limit
Distributed rate limiter

Data storage

Database schema

The NoSQL document store such as MongoDB is used to store the rate limiting rules and the throttled data

Type of data store

The NoSQL document store such as MongoDB stores the rate-limiting rules and the throttled data
The cache server such as Redis stores the rate-limiting rules and throttling data in-memory for faster lookups
The message queue stores the dropped requests for analytics and auditing purposes

High-level design

The cookie, user ID, or IP address is used to identify the client
The rate limiter drops the request and returns the status code “429 too many requests” to the client if the throttling threshold is exceeded
The rate limiter delegates the request to the API server if the throttling threshold is not exceeded

Throttling type and algorithms

Soft throttle: fixed window, token bucket
Hard throttle: sliding window, sliding window with counter, leaking bucket
Dynamic throttle: all of the above with an additional system query to check for free resources

Workflow

The client creates an HTTP connection to the web server
The web server forwards the request to the rate limiter service
The rate limiter service queries the rules cache to check the rate limit rule for the requested API endpoint
The read replicas of rules NoSQL data store are queried on a cache miss to identify the rate limit rule
The distributed lock such as Redis lock is used to handle concurrency when the same user makes multiple requests at the same time in a distributed system
The rate limiter service queries the throttle cache to verify if the throttle threshold is exceeded
The throttle cache uses a write-behind (write-back) cache pattern by storing the throttle data on the message queue to improve the latency
The throttle sync service executes a batch operation to store the throttle data on the message queue to the NoSQL document store
The NoSQL document stores persist the throttle data for fault tolerance, long-living rate limits, analytics, and auditing
The dropped requests (throttle threshold exceeded) are stored on the message queue for analytics and auditing
If the client request is not exceeding the throttle limit, the rate limiter delegates the client request to the API server
LRU cache eviction is used for cache servers
HTTP response headers indicate the relevant throttle limit data
Consistent hashing is used to redirect the request from a user to the same subset of servers

References

Paul Tarjan, Scaling your API with rate limiters, stripe.com
Rate-limiting strategies and techniques, cloud.google.com

Also published here.

Featured image source.