151 reads

Designing and Implementing Request Throttling

by Sneha MurganoorNovember 1st, 2024

Too Long; Didn't Read

Managing request flow effectively in distributed systems is crucial to maintaining stability, reliability, and performance. If unchecked, incoming traffic can overwhelm services, leading to degraded performance or complete failure. Developing a robust throttling mechanism ensures that requests are handled gracefully, balancing availability with system protection.

featured image - Designing and Implementing Request Throttling

This is part 2 of a 3-part series. Read part 1 here.

Throttling: Your First Line of Defense

Implementing throttling mechanisms helps prevent overuse of system resources and ensures fair distribution of access. Let’s examine different throttling strategies from the infrastructure to the application level.

1. Network-level throttling through Load Balancers

Load balancers act as the first line of defense by regulating the flow of traffic across multiple backend servers. They identify abnormal traffic patterns—such as a spike in requests from a particular IP—and limit or block access before the backend is impacted. Integrated firewalls further protect the system from Distributed Denial of Service (DDoS) attacks and prevent resource exhaustion during peak periods.

2. API Gateway Throttling

An API gateway is an intermediary between clients and backend services, implementing fine-grained control over incoming requests. API gateways can enforce rate limits based on API keys or user roles, ensuring traffic remains within acceptable thresholds. This method not only manages usage but also provides insights into traffic patterns through monitoring and logging.

3. Application-Level Throttling

For scenarios that demand more precise control, throttling logic can be embedded directly in the application code. This method allows real-time adjustments based on changing conditions, such as system load or specific user behavior. Application-level throttling provides granular control, ensuring services remain responsive even under heavy loads.

Getting Your Hands Dirty: Throttling Implementation

A theoretical understanding is only part of the solution—successful implementation requires careful planning and coding expertise. Let’s explore two key techniques: rate limiter TPS (i.e. Transactions per second), concurrent request control, resource-based and user/ip based throttling.

1. Rate Limiter TPS

The Token Bucket algorithm is a popular technique for rate limiting. It allows a fixed number of tokens (requests) to be processed over a specific period. When the token bucket is empty, further requests are delayed or rejected. This approach ensures that systems handle bursts of traffic smoothly without being overwhelmed.

import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

public class RateLimiter {
    private final int maxTokens;         // Maximum tokens allowed in the bucket
    private final int refillRate;        // Tokens added per second
    private AtomicInteger availableTokens;  // Current available tokens
    private long lastRefillTimestamp;    // Last time the bucket was refilled

    // Constructor to initialize the rate limiter
    public RateLimiter(int maxTokens, int refillRate) {
        this.maxTokens = maxTokens;
        this.refillRate = refillRate;
        this.availableTokens = new AtomicInteger(maxTokens);
        this.lastRefillTimestamp = System.nanoTime();
    }

    // Method to try consuming a token. Returns true if successful, false if rejected.
    public synchronized boolean tryAcquire() {
        refillTokens();  // Refill the tokens before processing the request
        if (availableTokens.get() > 0) {
            availableTokens.decrementAndGet();  // Consume one token
            return true;  // Request allowed
        } else {
            return false; // Request rejected
        }
    }

    // Method to refill tokens based on elapsed time
    private void refillTokens() {
        long now = System.nanoTime();
        long elapsedTime = now - lastRefillTimestamp;

        // Calculate the number of tokens to refill
        int tokensToAdd = (int) (TimeUnit.NANOSECONDS.toSeconds(elapsedTime) * refillRate);
        if (tokensToAdd > 0) {
            // Add tokens up to the max limit
            int newTokens = Math.min(maxTokens, availableTokens.get() + tokensToAdd);
            availableTokens.set(newTokens);
            lastRefillTimestamp = now;  // Update the last refill timestamp
        }
    }

    public static void main(String[] args) throws InterruptedException {
        RateLimiter rateLimiter = new RateLimiter(5, 1);  // Max 5 tokens, 1 token/second

        // Simulate 10 requests
        for (int i = 1; i <= 10; i++) {
            if (rateLimiter.tryAcquire()) {
                System.out.println("Request " + i + " processed.");
            } else {
                System.out.println("Request " + i + " rejected. Too many requests.");
            }
            Thread.sleep(500);  // Simulate 0.5 second between requests
        }
    }
}

2. Concurrent Request Control

When multiple requests arrive simultaneously, semaphores help maintain order by controlling the number of concurrent processes. If the semaphore's capacity is exhausted, new requests are either queued or denied. Using semaphores ensures that the system maintains stable performance even when faced with large volumes of simultaneous requests.

import java.util.concurrent.Semaphore;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class ConcurrentRequestLimiter {
    private final Semaphore semaphore;  // Semaphore to control concurrent access

    // Constructor to initialize with max concurrent requests
    public ConcurrentRequestLimiter(int maxConcurrentRequests) {
        this.semaphore = new Semaphore(maxConcurrentRequests);
    }

    // Method to acquire permission to process a request
    public boolean tryAcquire() {
        return semaphore.tryAcquire();  // Returns true if permit is available, otherwise false
    }

    // Method to release a permit after processing
    public void release() {
        semaphore.release();
    }

    // Simulate handling a request
    public void handleRequest(int requestId) {
        if (tryAcquire()) {
            System.out.println("Processing Request " + requestId);
            try {
                // Simulate processing time for the request
                TimeUnit.SECONDS.sleep(2);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            } finally {
                release();  // Release the permit after processing
                System.out.println("Finished Request " + requestId);
            }
        } else {
            System.out.println("Rejected Request " + requestId + ". Too many concurrent requests.");
        }
    }

    public static void main(String[] args) {
        int maxConcurrentRequests = 3;  // Allow up to 3 concurrent requests
        ConcurrentRequestLimiter limiter = new ConcurrentRequestLimiter(maxConcurrentRequests);

        // Use a thread pool to simulate multiple clients sending requests
        ExecutorService executor = Executors.newFixedThreadPool(10);

        // Simulate 10 incoming requests
        for (int i = 1; i <= 10; i++) {
            final int requestId = i;
            executor.submit(() -> limiter.handleRequest(requestId));
        }

        executor.shutdown();
    }
}

3. Resource-based Throttling

Resource-based throttling comes into play when system resources such as CPU or memory reach critical levels. To prevent DDoS attacks which would overwhelm the systems, we implement resource-based throttling. By setting predefined thresholds such as a maximum CPU utilization of 80%, throttling mechanisms ensure that additional traffic is delayed or rejected before performance deteriorates.

Example in Practice:

Scenario: An API must maintain < 200ms latency.
Method:
Use a load testing tool to gradually increase traffic.
Monitor CPU, memory, and response times.
Identify that:
At 400 TPS: CPU usage is 70%, and latency is 150ms.
At 500 TPS: CPU usage jumps to 85%, latency spikes to 300ms, and 5% of requests fail.
Thresholds:
CPU Threshold: 75-80% (to avoid bottlenecks).
Max Safe TPS: 450 TPS (with existing infrastructure).

Once we determine this, we could use ratelimiter tps or concurrent request throttling or both to implement resource-based throttling.

4. User-based or IP-based Throttling in Distributed Systems

Implementing user-based or IP-based throttling in a distributed system requires a coordinated approach to ensure that requests are managed across multiple servers or instances. Here's an example of implementing user-based throttling in a distributed system using Redis for the token bucket approach:

import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;

public class DistributedRateLimiter {
    private final JedisPool jedisPool;
    private final int maxTokens;
    private final int refillRate;  // Tokens added per second

    public DistributedRateLimiter(JedisPool jedisPool, int maxTokens, int refillRate) {
        this.jedisPool = jedisPool;
        this.maxTokens = maxTokens;
        this.refillRate = refillRate;
    }

    public boolean isAllowed(String userId) {
        try (Jedis jedis = jedisPool.getResource()) {
            long currentTime = System.currentTimeMillis();
            String tokenKey = "ratelimit:" + userId;
            String lastRefillKey = "lastRefill:" + userId;

            // Get current tokens
            String currentTokens = jedis.get(tokenKey);
            Long tokens = currentTokens != null ? Long.parseLong(currentTokens) : (long) maxTokens;

            // Get last refill time
            String lastRefillTime = jedis.get(lastRefillKey);
            long lastRefill = lastRefillTime != null ? Long.parseLong(lastRefillTime) : currentTime;

            // Calculate new tokens based on time passed since last refill
            long timePassed = currentTime - lastRefill;
            long newTokens = Math.min(maxTokens, tokens + (timePassed / 1000) * refillRate);

            // Update token and refill time in Redis
            jedis.set(tokenKey, String.valueOf(newTokens));
            jedis.set(lastRefillKey, String.valueOf(currentTime));

            // Check if a token can be consumed
            if (newTokens > 0) {
                jedis.decr(tokenKey);  // Consume a token
                return true;  // Request allowed
            }

            return false;  // Request throttled
        }
    }
}

Conclusion

Effective request management is not just about limiting access—it’s about maintaining a delicate balance between usability and protection. Whether through rate limiting, concurrent control, user/ip-based, or resource-based throttling, each strategy plays a vital role in system stability. Organizations must continuously monitor, refine, and adapt their throttling policies to align with changing requirements and traffic patterns. With these strategies in place, systems remain resilient, responsive, and ready to scale.

Stay tuned for Part 3: Over-Throttling and Under-Throttling – Achieving Balance