Picture this: You are a dev in a payment services company. The thing with payment transactions is that you are supposed to process each transaction (e.g. send money from John's account to Jane) exactly once. It's 3 AM, your PagerDuty is blowing up with alerts about transactions being processed multiple times by cron jobs, and you're staring at Redis locks that should prevent this exact problem. Sound familiar? should We didn't end up sending money to someone 20 times, but our problem was similar. This is the story of how we went from a complex Redis-based locking nightmare to a beautifully simple database-centric solution, and why you should probably ditch those distributed locks too. The Great Redis Lock Disaster of 2025 Let me set the scene. We have this automated testing platform where developers would push their OpenAPI specs, and our job workers would generate tests in the background. Think Postman, but the test generation is automated. Simple enough, right? Wrong. The system worked like this: Developer pushes an OpenAPI spec with 50 endpoints to our platform We create 50 "pending" test generation tasks in the database Multiple job instances fight over who gets to process what Chaos ensues, developers get duplicate test suites Developer pushes an OpenAPI spec with 50 endpoints to our platform We create 50 "pending" test generation tasks in the database Multiple job instances fight over who gets to process what Chaos ensues, developers get duplicate test suites Our "brilliant" solution was Redis user-level locking: def main(): redis = redis_client() # Get users with pending tasks users_with_pending_tasks = get_users_with_pending_tasks() # Try to acquire lock for a user for user in users_with_pending_tasks: user_id = user["user_id"] lock_acquired = redis.setnx(f"process_lock_{user_id}", user_id) if lock_acquired: print(f"Acquired lock for user {user_id}") process_user_tasks(user_id) redis.delete(f"process_lock_{user_id}") # Release lock break def main(): redis = redis_client() # Get users with pending tasks users_with_pending_tasks = get_users_with_pending_tasks() # Try to acquire lock for a user for user in users_with_pending_tasks: user_id = user["user_id"] lock_acquired = redis.setnx(f"process_lock_{user_id}", user_id) if lock_acquired: print(f"Acquired lock for user {user_id}") process_user_tasks(user_id) redis.delete(f"process_lock_{user_id}") # Release lock break This looked solid on paper. One job per user, clean separation, what could go wrong? Turns out, a lot. Here's what we discovered after days of debugging. The Ghost Lock Problem Jobs would crash (because who writes perfect code?), leaving behind zombie locks in Redis. Developer #12345's tasks would be forever locked, waiting for a job that no longer exists. We'd have to manually clean these up, which is about as fun as debugging CSS alignment issues. The Race Condition Ballet Even worse, we had this beautiful race condition where two jobs would: Both check if a developer has pending tasks ✓ Both try to acquire the same user's lock One succeeds, one fails, but... The winner sometimes processed tasks that were already being handled Both check if a developer has pending tasks ✓ Both try to acquire the same user's lock One succeeds, one fails, but... The winner sometimes processed tasks that were already being handled It was like watching two developers simultaneously fix the same bug in different branches. The "It Should Work" Syndrome The most frustrating part? The logic was sound. User-level locking should prevent duplicate processing. But we were still getting duplicate test suites generated, and developers were opening GitHub issues faster than we could close them. should After staring at this code for the hundredth time, my teammate dropped this gem: "Why are we even using Redis for this? Isn't our database already designed to handle concurrency?" The Database Epiphany They were right. Here we were, adding this complex external dependency when PostgreSQL has been solving concurrency problems since before Redis was even a twinkle in antirez's eye. The solution was embarrassingly simple: def claim_task(task_id): result = execute_query( "UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'", (task_id,) ) return result.rowcount > 0 # True if we successfully claimed it def claim_task(task_id): result = execute_query( "UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'", (task_id,) ) return result.rowcount > 0 # True if we successfully claimed it That's it. Twelve lines of Redis complexity replaced by three lines of SQL. Why This Actually Works When two jobs try to claim the same task simultaneously: Job A executes the UPDATE first (we're talking microseconds here) Job B executes the UPDATE immediately after Job A finds status = 'pending', updates it to 'processing', returns rowcount = 1 Job B finds status = 'processing' (not 'pending'), updates nothing, returns rowcount = 0 Job A executes the UPDATE first (we're talking microseconds here) Job A Job B executes the UPDATE immediately after Job B Job A finds status = 'pending', updates it to 'processing', returns rowcount = 1 status = 'pending' rowcount = 1 Job B finds status = 'processing' (not 'pending'), updates nothing, returns rowcount = 0 status = 'processing' rowcount = 0 The database engine handles all the locking, isolation, and consistency for us. It's literally what ACID properties were designed for. The New Approach: Beautifully Boring Here's what our main processing loop became: def main(): # Get all pending tasks - dead simple pending_tasks = get_tasks_by_status("pending") for task in pending_tasks: # Try to atomically claim this task if claim_task(task["id"]): print(f"Got task {task['id']}, let's go!") process_task(task) else: print(f"Task {task['id']} stolen by another worker, moving on...") def claim_task(task_id): result = execute_query( "UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'", (task_id,) ) return result.rowcount > 0 def main(): # Get all pending tasks - dead simple pending_tasks = get_tasks_by_status("pending") for task in pending_tasks: # Try to atomically claim this task if claim_task(task["id"]): print(f"Got task {task['id']}, let's go!") process_task(task) else: print(f"Task {task['id']} stolen by another worker, moving on...") def claim_task(task_id): result = execute_query( "UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'", (task_id,) ) return result.rowcount > 0 No Redis calls. No lock cleanup. No timeouts. No complex error handling. Just pure, boring database operations. Enter the Resource Hog Problem But wait, there's more! (There's always more, isn't there?) Our celebration was short-lived. Within a week, we discovered a new problem: one startup uploaded their monolithic API spec with 1,000 endpoints (yes, we've all been there). Guess what happened? All our job instances started fighting over that user's tasks, completely ignoring everyone else. Meanwhile, Sarah uploaded her simple microservice spec with 3 endpoints and watched it sit in the queue for hours while MegaCorp's monolith hogged all the workers. Classic tragedy of the commons. This is where the simplicity of our solution became both a blessing and a curse. It was too fair - treating all tasks equally regardless of user impact. The Fairness Fix: Not All Users Are Created Equal We went with the simplest solution first: limit each user to 2 concurrent test generation tasks max. def main(): pending_tasks = get_tasks_by_status("pending") for task in pending_tasks: # Check if this user is already hogging resources user_task_count = get_user_task_count(task["user_id"], "processing") if user_task_count >= 2: # Max 2 concurrent tasks per user print(f"User {task['user_id']} hit their limit, skipping...") continue # Try to claim the task if claim_task(task["id"]): print(f"Processing task for user {task['user_id']}") process_task(task) def get_user_task_count(user_id, status): result = execute_query( "SELECT COUNT(*) as count FROM tasks WHERE user_id = %s AND status = %s", (user_id, status) ) return result["count"] def main(): pending_tasks = get_tasks_by_status("pending") for task in pending_tasks: # Check if this user is already hogging resources user_task_count = get_user_task_count(task["user_id"], "processing") if user_task_count >= 2: # Max 2 concurrent tasks per user print(f"User {task['user_id']} hit their limit, skipping...") continue # Try to claim the task if claim_task(task["id"]): print(f"Processing task for user {task['user_id']}") process_task(task) def get_user_task_count(user_id, status): result = execute_query( "SELECT COUNT(*) as count FROM tasks WHERE user_id = %s AND status = %s", (user_id, status) ) return result["count"] Simple, effective, and Sarah is happy again. There Are A Few Other Ways To Solve This The Single-Query Approach: For the SQL Wizards If you want to be fancy (and reduce database calls), you can do the fairness check and task claiming in one atomic operation: def claim_task_with_fairness(task_id, max_concurrent=2): result = execute_query(""" UPDATE tasks t1 SET status = 'processing' WHERE t1.id = %s AND t1.status = 'pending' AND ( SELECT COUNT(*) FROM tasks t2 WHERE t2.user_id = t1.user_id AND t2.status = 'processing' ) < %s """, (task_id, max_concurrent)) return result.rowcount > 0 def claim_task_with_fairness(task_id, max_concurrent=2): result = execute_query(""" UPDATE tasks t1 SET status = 'processing' WHERE t1.id = %s AND t1.status = 'pending' AND ( SELECT COUNT(*) FROM tasks t2 WHERE t2.user_id = t1.user_id AND t2.status = 'processing' ) < %s """, (task_id, max_concurrent)) return result.rowcount > 0 This is beautiful from a database perspective - one query does it all. But it's harder to debug when things go wrong, and trust me, things will go wrong. The Round-Robin Approach: Maximum Fairness For maximum fairness, you can prioritize users who have fewer tasks running: def get_fair_pending_tasks(): """Get tasks ordered by user fairness - users with fewer running tasks go first""" return execute_query(""" SELECT t1.* FROM tasks t1 LEFT JOIN ( SELECT user_id, COUNT(*) as running_count FROM tasks WHERE status = 'processing' GROUP BY user_id ) t2 ON t1.user_id = t2.user_id WHERE t1.status = 'pending' ORDER BY COALESCE(t2.running_count, 0) ASC, t1.created_at ASC """) def get_fair_pending_tasks(): """Get tasks ordered by user fairness - users with fewer running tasks go first""" return execute_query(""" SELECT t1.* FROM tasks t1 LEFT JOIN ( SELECT user_id, COUNT(*) as running_count FROM tasks WHERE status = 'processing' GROUP BY user_id ) t2 ON t1.user_id = t2.user_id WHERE t1.status = 'pending' ORDER BY COALESCE(t2.running_count, 0) ASC, t1.created_at ASC """) This query is doing some heavy lifting: Get all pending tasks Count how many tasks each user has running Order by fewest running tasks first, then by creation time Get all pending tasks Count how many tasks each user has running Order by fewest running tasks first, then by creation time It's more complex but gives you true round-robin fairness. MegaCorp's monolith still gets processed, but not at Sarah's microservice's expense. The Time-Based Approach: "When Did I Last Process This User?" You could also add a last_processed_at timestamp to users and prioritize those who haven't been processed recently. But honestly, that's probably overkill unless you're running something like GitHub Actions at scale. last_processed_at The Lessons I Wish I'd Learned Sooner 1. Your Database Is Smarter Than You Think I spent weeks building a distributed locking system when PostgreSQL was sitting there like "I've literally been doing this since 1996, but okay..." Databases are designed for concurrency. ACID properties exist for exactly these scenarios. Use them. designed 2. Complexity Is a Bug, Not a Feature Every line of Redis locking code was a potential failure point: Network timeouts Lock cleanup failures Race conditions between services Memory management in Redis Network timeouts Lock cleanup failures Race conditions between services Memory management in Redis The database solution eliminated all of this. Sometimes the best code is the code you don't write. 3. Fairness Isn't Optional in Multi-Tenant Systems We learned this the hard way when MegaCorp's monolith starved all the microservice users. If you're building anything where multiple users compete for resources, think about fairness from day one, not when your users start filing angry GitHub issues. 4. Start Simple, Then Optimize We went with the two-query approach (check user count, then claim task) rather than the fancy single-query version. Why? Because when something breaks at 3 AM, you want to be able to debug it quickly. Performance Reality Check Let's be honest about the trade-offs: Database Approach: Database Approach: ✅ No Redis memory usage ✅ Leverages existing database infrastructure ✅ ACID guarantees ❌ Extra SELECT query for fairness checks ❌ Slightly higher database load ✅ No Redis memory usage ✅ Leverages existing database infrastructure ✅ ACID guarantees ❌ Extra SELECT query for fairness checks ❌ Slightly higher database load Redis Approach: Redis Approach: ✅ Fast in-memory operations ✅ Dedicated locking primitives ❌ Additional infrastructure to maintain ❌ Network calls can fail ❌ Lock cleanup complexity ❌ Memory management ✅ Fast in-memory operations ✅ Dedicated locking primitives ❌ Additional infrastructure to maintain ❌ Network calls can fail ❌ Lock cleanup complexity ❌ Memory management For our use case, the database approach was clearly better. Your mileage may vary. When To Use Each Approach Go with database atomicity when: Go with database atomicity when: Your tasks live in the database anyway You want strong consistency guarantees You're trying to reduce infrastructure complexity You trust your database more than your distributed systems skills (smart choice) Your tasks live in the database anyway You want strong consistency guarantees You're trying to reduce infrastructure complexity You trust your database more than your distributed systems skills (smart choice) Stick with Redis/distributed locks when: Stick with Redis/distributed locks when: You need locks across multiple databases Tasks involve complex multi-step operations You already have Redis infrastructure you're comfortable with You're building something like a workflow engine with complex state You need locks across multiple databases Tasks involve complex multi-step operations You already have Redis infrastructure you're comfortable with You're building something like a workflow engine with complex state The Bottom Line We replaced 50 lines of complex Redis locking logic with 5 lines of SQL and immediately solved our duplicate processing problem. Sometimes the best engineering solution is the boring one. Your database has been solving concurrency problems longer than most of us have been writing code. Maybe it's time to trust it.