Managing request flow effectively in distributed systems is crucial to maintaining stability, reliability, and performance. If unchecked, incoming traffic can overwhelm services, leading to degraded performance or complete failure. Developing a robust throttling mechanism ensures that requests are handled gracefully, balancing availability with system protection.
This is part 2 of a 3-part series. Read part 1 here.
Implementing throttling mechanisms helps prevent overuse of system resources and ensures fair distribution of access. Let’s examine different throttling strategies from the infrastructure to the application level.
Load balancers act as the first line of defense by regulating the flow of traffic across multiple backend servers. They identify abnormal traffic patterns—such as a spike in requests from a particular IP—and limit or block access before the backend is impacted. Integrated firewalls further protect the system from Distributed Denial of Service (DDoS) attacks and prevent resource exhaustion during peak periods.
An API gateway is an intermediary between clients and backend services, implementing fine-grained control over incoming requests. API gateways can enforce rate limits based on API keys or user roles, ensuring traffic remains within acceptable thresholds. This method not only manages usage but also provides insights into traffic patterns through monitoring and logging.
For scenarios that demand more precise control, throttling logic can be embedded directly in the application code. This method allows real-time adjustments based on changing conditions, such as system load or specific user behavior. Application-level throttling provides granular control, ensuring services remain responsive even under heavy loads.
A theoretical understanding is only part of the solution—successful implementation requires careful planning and coding expertise. Let’s explore two key techniques: rate limiter TPS (i.e. Transactions per second), concurrent request control, resource-based and user/ip based throttling.
The Token Bucket algorithm is a popular technique for rate limiting. It allows a fixed number of tokens (requests) to be processed over a specific period. When the token bucket is empty, further requests are delayed or rejected. This approach ensures that systems handle bursts of traffic smoothly without being overwhelmed.
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
public class RateLimiter {
private final int maxTokens; // Maximum tokens allowed in the bucket
private final int refillRate; // Tokens added per second
private AtomicInteger availableTokens; // Current available tokens
private long lastRefillTimestamp; // Last time the bucket was refilled
// Constructor to initialize the rate limiter
public RateLimiter(int maxTokens, int refillRate) {
this.maxTokens = maxTokens;
this.refillRate = refillRate;
this.availableTokens = new AtomicInteger(maxTokens);
this.lastRefillTimestamp = System.nanoTime();
}
// Method to try consuming a token. Returns true if successful, false if rejected.
public synchronized boolean tryAcquire() {
refillTokens(); // Refill the tokens before processing the request
if (availableTokens.get() > 0) {
availableTokens.decrementAndGet(); // Consume one token
return true; // Request allowed
} else {
return false; // Request rejected
}
}
// Method to refill tokens based on elapsed time
private void refillTokens() {
long now = System.nanoTime();
long elapsedTime = now - lastRefillTimestamp;
// Calculate the number of tokens to refill
int tokensToAdd = (int) (TimeUnit.NANOSECONDS.toSeconds(elapsedTime) * refillRate);
if (tokensToAdd > 0) {
// Add tokens up to the max limit
int newTokens = Math.min(maxTokens, availableTokens.get() + tokensToAdd);
availableTokens.set(newTokens);
lastRefillTimestamp = now; // Update the last refill timestamp
}
}
public static void main(String[] args) throws InterruptedException {
RateLimiter rateLimiter = new RateLimiter(5, 1); // Max 5 tokens, 1 token/second
// Simulate 10 requests
for (int i = 1; i <= 10; i++) {
if (rateLimiter.tryAcquire()) {
System.out.println("Request " + i + " processed.");
} else {
System.out.println("Request " + i + " rejected. Too many requests.");
}
Thread.sleep(500); // Simulate 0.5 second between requests
}
}
}
When multiple requests arrive simultaneously, semaphores help maintain order by controlling the number of concurrent processes. If the semaphore's capacity is exhausted, new requests are either queued or denied. Using semaphores ensures that the system maintains stable performance even when faced with large volumes of simultaneous requests.
import java.util.concurrent.Semaphore;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class ConcurrentRequestLimiter {
private final Semaphore semaphore; // Semaphore to control concurrent access
// Constructor to initialize with max concurrent requests
public ConcurrentRequestLimiter(int maxConcurrentRequests) {
this.semaphore = new Semaphore(maxConcurrentRequests);
}
// Method to acquire permission to process a request
public boolean tryAcquire() {
return semaphore.tryAcquire(); // Returns true if permit is available, otherwise false
}
// Method to release a permit after processing
public void release() {
semaphore.release();
}
// Simulate handling a request
public void handleRequest(int requestId) {
if (tryAcquire()) {
System.out.println("Processing Request " + requestId);
try {
// Simulate processing time for the request
TimeUnit.SECONDS.sleep(2);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} finally {
release(); // Release the permit after processing
System.out.println("Finished Request " + requestId);
}
} else {
System.out.println("Rejected Request " + requestId + ". Too many concurrent requests.");
}
}
public static void main(String[] args) {
int maxConcurrentRequests = 3; // Allow up to 3 concurrent requests
ConcurrentRequestLimiter limiter = new ConcurrentRequestLimiter(maxConcurrentRequests);
// Use a thread pool to simulate multiple clients sending requests
ExecutorService executor = Executors.newFixedThreadPool(10);
// Simulate 10 incoming requests
for (int i = 1; i <= 10; i++) {
final int requestId = i;
executor.submit(() -> limiter.handleRequest(requestId));
}
executor.shutdown();
}
}
Resource-based throttling comes into play when system resources such as CPU or memory reach critical levels. To prevent DDoS attacks which would overwhelm the systems, we implement resource-based throttling. By setting predefined thresholds such as a maximum CPU utilization of 80%, throttling mechanisms ensure that additional traffic is delayed or rejected before performance deteriorates.
Example in Practice:
Scenario: An API must maintain < 200ms latency.
Method:
Use a load testing tool to gradually increase traffic.
Monitor CPU, memory, and response times.
Identify that:
At 400 TPS: CPU usage is 70%, and latency is 150ms.
At 500 TPS: CPU usage jumps to 85%, latency spikes to 300ms, and 5% of requests fail.
Thresholds:
CPU Threshold: 75-80% (to avoid bottlenecks).
Max Safe TPS: 450 TPS (with existing infrastructure).
Once we determine this, we could use ratelimiter tps or concurrent request throttling or both to implement resource-based throttling.
Implementing user-based or IP-based throttling in a distributed system requires a coordinated approach to ensure that requests are managed across multiple servers or instances. Here's an example of implementing user-based throttling in a distributed system using Redis for the token bucket approach:
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
public class DistributedRateLimiter {
private final JedisPool jedisPool;
private final int maxTokens;
private final int refillRate; // Tokens added per second
public DistributedRateLimiter(JedisPool jedisPool, int maxTokens, int refillRate) {
this.jedisPool = jedisPool;
this.maxTokens = maxTokens;
this.refillRate = refillRate;
}
public boolean isAllowed(String userId) {
try (Jedis jedis = jedisPool.getResource()) {
long currentTime = System.currentTimeMillis();
String tokenKey = "ratelimit:" + userId;
String lastRefillKey = "lastRefill:" + userId;
// Get current tokens
String currentTokens = jedis.get(tokenKey);
Long tokens = currentTokens != null ? Long.parseLong(currentTokens) : (long) maxTokens;
// Get last refill time
String lastRefillTime = jedis.get(lastRefillKey);
long lastRefill = lastRefillTime != null ? Long.parseLong(lastRefillTime) : currentTime;
// Calculate new tokens based on time passed since last refill
long timePassed = currentTime - lastRefill;
long newTokens = Math.min(maxTokens, tokens + (timePassed / 1000) * refillRate);
// Update token and refill time in Redis
jedis.set(tokenKey, String.valueOf(newTokens));
jedis.set(lastRefillKey, String.valueOf(currentTime));
// Check if a token can be consumed
if (newTokens > 0) {
jedis.decr(tokenKey); // Consume a token
return true; // Request allowed
}
return false; // Request throttled
}
}
}
Effective request management is not just about limiting access—it’s about maintaining a delicate balance between usability and protection. Whether through rate limiting, concurrent control, user/ip-based, or resource-based throttling, each strategy plays a vital role in system stability. Organizations must continuously monitor, refine, and adapt their throttling policies to align with changing requirements and traffic patterns. With these strategies in place, systems remain resilient, responsive, and ready to scale.
Stay tuned for Part 3: Over-Throttling and Under-Throttling – Achieving Balance