Skills for Technical Leadership: On Distributed Transactions

This is a part of an ongoing series. Check out my profile for more.

It’s been a couple months, but I’m back and ready to talk about coordinating state across a distributed system, specifically with regards to distributed transactions. Many businesses require transactional operations, that is, an operation consisting of many parts that should succeed or fail as an atomic unit. When we get to the world of SOA, things become more complicated as we’ll see in this post. We’ll start off by looking at a non-transactional operation, then move to a transactional operation to get a sense of the differences in problems posed by the two scenarios.

As seen in the previous post, resolving the partial failure of an operation that acts across a distributed system can be complicated. Last time, we looked at a distributed blogging service:

def publish_blog_post(post_id)PostDB.execute(“UPDATE posts WHERE id = ? SET published = ‘t’”, post_id)NotificationService.notify_friends(post_id)FeedUpdatingService.update_feeds(post_id)TwitterService.tweet_about_post(post_id)end

As discussed in the previous post, calls like NotificationService.notify_friends can fail or time out, leaving us in a state of uncertainty. Previously we discussed idempotency as a strategy for solving such issues: if we retry once or twice the request will likely eventually succeed, and we don’t need to worry about the issue of double-performing any action. We were able to greatly reduce complexity by making each operation — including publish_blog_post itself — idempotent.

In this example there’s no real interdependence between all those downstream service calls; the successive success or failure of each service request should not impact the others. To be more precise, the operation publish_blog_post appears to be non-transactional; if one service call fails, we likely do not want to rollback any of the other operations. Sometimes we don’t get quite as lucky, and find ourselves in a situation that requires operations of a transactional nature.

Let’s take an example e-commerce application for buying and selling tickets. A user can sell a ticket to another user. The buyer wants to pay with a voucher we gave him. When this happens, the application should “reassign” the ticket to the buyer, use up the buyer’s voucher, and credit the seller’s account. In this example, our company chose to split our services three-fold: an account service which manages payouts to a user’s bank account, a promo service which manages promotions and vouchers, and the ticketing service which coordinates the selling and ownership of a ticket. Here’s some example code for selling the ticket in the ticket service:

def sell_ticket(buyer_id, seller_id, ticket_id)ticket = TicketDB.execute(“SELECT * FROM tickets WHERE id = ?”, ticket_id)TicketDB.execute(“UPDATE tickets SET user_id = ? WHERE id = ?”, buyer_id, ticket_id)PromoService.debit(buyer_id, ticket.amount)AccountService.credit(seller_id, ticket.amount)end

This operation is clearly transactional: we shouldn’t let the buyer get charged without getting his ticket, and we shouldn’t let the seller trade away his ticket without getting credited for the sale. Keeping in mind that any of these services could go down at any point in time, we have to account for all sorts of failure scenarios. What happens if the “credit” call starts breaking? Can we recover and undo the calls to reassign the ticket and debit the buyer? What if the debit call times out? Do we have the ability to retry and move forward? If the service repeatedly times out, can we reassign the ticket back to the seller? As you can see, there are a lot of issues here, and the complexity in the code used solving these issues could grow greatly. Idempotency could be useful as it was in our blogging application, but it might not be sufficient to simply retry ad nauseam as we did in the blogging platform. If a system fails for enough requests, we may wish to “rollback” the other operations.

In a monolithic application using a traditional RDBMS this would be an excellent case for using a database transaction like so:

BEGIN;UPDATE tickets SET user_id = <buyer_id> WHERE id = <ticket_id>;UPDATE voucher_balances SET balance = balance — <amount> WHERE user_id = <buyer_id>;UPDATE accounts SET balance = balance + <amount> WHERE user_id = <seller_id>;COMMIT;

Concurrency concerns aside, the transactional properties of this example get us the behavior we want: nothing succeeds unless each command succeeds. The semantics of SQL permit this behavior: the database knows that these commands are to be grouped together as a unit from the BEGIN…COMMIT messages.

It would be nice if we could simplify lift this idea and use it in our distributed ticketing platform, but things get a lot more complicated. First off, their are now at least four actors (the ticketing application, its database, the voucher service, the account service), probably more (since the other services have their own datastores and possibly downstream dependencies). In the monolith system, there were only two (monolith and database). Managing consistency across all of those systems grows in complexity since you have to consider what might happen if any of those were to fail individually (or in tandem). I’ve seen code that attempts to catch and “undo” errors in these scenarios to roll back when failure occurs. Here’s a simplified example of code where we attempt to protect against failure in the AccountingService:

TicketDB.execute(“UPDATE tickets SET user_id = ? WHERE id = ?”, buyer_id, ticket_id)PromoService.debit(buyer_id, ticket.amount)beginAccountingService.credit(seller_id, ticket.amount)rescue Exception => ePromoService.credit(buyer_id, ticket.amount)TicketDB.execute(“UPDATE tickets SET user_id = ? WHERE id = ?”, seller_id, ticket_id)end

This, of course, has its flaws. Foremost, the calls to re-credit the buyer’s promo or assign the ticket back to the user can fail. The code complexity has increased. And we haven’t even addressed the issue of timeouts or retries yet! Continuing down this path, we’ll likely end up with hard-to-manage code that is prone to edge cases and is hard to follow.

Some might attempt to ameliorate this issue with distributed commitment algorithms like two-phase commit, three-phase commit, or others. Note that these algorithms do not give you RDBMS-style rollbacks, nor do they guarantee the ACID-ity of your transaction, but rather merely coordinate a decision to commit a transaction across your services (i.e. “do we all agree to commit this transaction or not?”). This might make it marginally easier to avoid complexity, but you still might have to craft code to undo each operation should the transaction go awry. Moreover, note that implementing these algorithms is usually not trivial.

Rather than suffer, you should try to minimize the possibility of distributed transactions altogether. If you’re finding that there are a lot of atomic operations broken across services, you might want to reassess your service boundaries. Doing so will likely reduce the risk of errors and make your code more comprehensible. Sometimes you really do have to support a distributed transactional operation, and unfortunately there’s no silver bullet here. You’ll have to implement some code that manages the operation and enforces each dependent system processes its part, and is able to roll back gracefully as much as possible if needed.

That’s all for this time — there will be more to come on distributed systems, so hang tight.