paint-brush
A Mistake That Could Cost You Millions: What it Is and How to Fix Itby@neenadingole
444 reads
444 reads

A Mistake That Could Cost You Millions: What it Is and How to Fix It

by Neenad IngoleMarch 10th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A small golang mistake that could cost a Million Dollars.
featured image - A Mistake That Could Cost You Millions: What it Is and How to Fix It
Neenad Ingole HackerNoon profile picture


This post is about a real problem I faced in my project. I merged two blog post into one, so this is a bit longer post, please read till the end. Thank you!


I hope after reading this post you will be able to avoid the same mistake we did in our project 😅. This small mistake didn’t cost us a million dollars. But, it did cost us a few thousand, and the Prometheus alerts saved us.


But, it could cost you a million if you are not careful 💸. I learned my lessons through my mistakes; this is something that I will never forget in the future.


The scenario is a simple database transaction. We all have used database transactions at least once in our programming life. If you don’t know how transactions work, you could read the docs here for Postgres.


Our service handles the management of some subscriptions in the database. All the changes to the rows happen within a transaction. The service is in Golang, and the flow is like this:


  • Start the transaction.
  • Fetch the record by ID.
  • Confirm if the operation for the change is possible.
  • If Yes, then update the record.
  • Commit the transaction.
  • For any error, revert the transaction.


To fetch a record and acquire a lock on the record so that other flows would wait to update, we use SELECT ... FOR UPDATE query to fetch the record from Postgres and lock it.


Note: We use sqlx lib for database access.


The repository code is like this:

func (r *fetcher) GetSubscription(tx *sqlx.Tx, id uuid.UUID) (*model.Subscription, error) {
    var subscription model.Subscription
    err := tx.Get(&subscription, `
        SELECT * FROM subscriptions
        WHERE id = $1
        FOR UPDATE
    `, id)
    if err != nil {
        return nil, err
    }
 
    return &subscription, nil
}


The service code is like this:

func (s *service) CancelSubscription(ctx context.Context, id uuid.UUID) (*model.Subscription, error) {
    tx, err := s.db.BeginTxx(ctx, nil)
    if err != nil {
        return nil, err
    }
	
    defer func() {
        if err != nil {
            tx.Rollback()
        }
    }()
    
    subscription, err := s.fetcher.GetSubscription(tx, id)
    if err != nil {
        return nil, err
    }

    if subscription.CancelledAt != nil {
        return subscription, nil
    }
	
    subscription.CancelledAt = time.Now()

    err = s.updater.UpdateSubscription(tx, subscription)
    if err != nil {
        return nil, err
    }

    err = tx.Commit()
    if err != nil {
        return nil, err
    }

    return subscription, nil
}


Problem

All client requests were timing out, and the DB connection spiked over 1.2k connections 😅. Not a single request was able to complete the operation. It was a stop-the-world event 🤣



This is me when this happened and I missed one case


Why?

The issue happens when we try to cancel a subscription that is already canceled. If the subscription is already canceled, we return the subscription without doing any changes, but we are not releasing the lock on the record.


The reason is defer function calls tx.Rollback() only when there is an error. This would cause the lock to be active until the transaction commits or roll back. But since we are not doing any of the two things, the lock is held until the transaction times out.


If the service is having high traffic, the transactions could not timeout. This will cause the code to create new connections to the database and exhaust the connection pool.

Fix

  1. Release the lock in every if condition.
// Error handling is omitted for brevity
if subscription.CancelledAt != nil {
    _ = tx.Rollback() // release the lock

    return subscription, nil
}


This is the simplest way to fix the issue. But this would need you to roll back in every if condition. And if you forget to roll back in any of the if conditions, you will have the same issue.


2. Rollback the transaction in the defer function for every case.

// Error handling is omitted for brevity
defer func() {
   _ = tx.Rollback()
}()


This would release the lock in every case. For committed transactions and as per the tracing, we see some milliseconds the code spends to roll back the transactions. In the case of committed transactions, the rollback does nothing. But this is a better way to fix the issue.


3. Commit the transaction in the defer function for every case.

defer func() {
  if err != nil {
    _ = tx.Rollback()
  }
  
  commitErr := tx.Commit()
  // handle commitErr
}


If there are no changes, the transaction will commit without any change. If there is any error, only then will the rollback happen and will release the lock.


But, this affects the readability of the code. Your commit is happening in the defer function, and that's an extra pointer you have to keep in mind while reading the code.

Service Layer — Return Error 👾

The problem with the service is that for validations, it is not returning any error. The other way to fix this issue is to return an error from the if condition. This would make the Rollback happen in defer function.


This is a very clean way. A service should have only one responsibility. Either it completes the operation, or it returns an error. It should return an error in case of validation failure. This is a good practice and something we missed in our project.


We fixed the issue later by returning an error from the if condition.


This change also helps the handler to decide what HTTP status code to return. Which is really helpful as we can return 400 Bad Request for validation errors.


This is how the code would look after refactoring:

var ErrSubscriptionAlreadyCancelled = errors.New("subscription already cancelled")

func (s *service) CancelSubscription(ctx context.Context, id uuid.UUID) (*model.Subscription, error) {
    tx, err := s.db.BeginTxx(ctx, nil)
    if err != nil {
        return nil, err
    }
	
    defer func() {
        if err != nil {
            tx.Rollback()
        }
    }()
    
    subscription, err := s.fetcher.GetSubscription(tx, id)
    if err != nil {
        return nil, err
    }
    if subscription.CancelledAt != nil {
        return subscription, ErrSubscriptionAlreadyCancelled
    }
 
    subscription.CancelledAt = time.Now()
    err = s.updater.UpdateSubscription(tx, subscription)
    if err != nil {
        return nil, err
    }
    err = tx.Commit()
    if err != nil {
        return nil, err
    }
    return subscription, nil
}


My Take

My personal preference is the second option. But you could choose any of the three options based on your preference. I am choosing the second option because I am sure that whatever happens in the flow at the end, the transaction revert will happen.


Yes, it would cost me a few milliseconds in case of a committed transaction, but compared to the loss to the business, it’s worth it.


Keep in mind that the service layer has only one responsibility. Either it completes the operation or it returns an error.


You may be thinking that a better approach is to have an abstraction that takes care of transactions. I agree, and would also have a separate post on that. But, I also think that we can avoid the complexity of abstraction if the code is stable and not changes too often.


The golang official post for transactions also supports the reasoning. Check the code snippet here. The office post also uses the second option.


// code snippet from the golang official post
func CreateOrder(ctx context.Context, albumID, quantity, custID int) (orderID int64, err error) {
// Create a helper function for preparing failure results.
  fail := func (err error) (int64, error) {
    return fmt.Errorf("CreateOrder: %v", err)
  }
  // Get a Tx for making transaction requests.
  tx, err := db.BeginTx(ctx, nil)
  if err != nil {
    return fail(err)
  }
  // Defer a rollback in case anything fails.
  defer tx.Rollback()
  
  ... // other code is omitted for brevity
  
  // Commit the transaction.
  if err = tx.Commit(); err != nil {
    return fail(err)
  }
  
  // Return the order ID.
  return orderID, nil
}


Wait hold on… There are two more ways to fix the problem. This will also provide one extra layer of safety. In my opinion, having multiple safety gates is always better.


If one fails, there is another one to safeguard with any issues. The business-critical services with an extra layer will protect against any monetary loss.


Excited to know more?


This is how I react when I get multiple ways to solve a problem


Using a Context

We could use a context to cancel the DB transaction if it takes too long. This is a good approach, yet, you would need to fine-tune the timeout value for the context.


If the timeout is too low, you might end up canceling the DB transaction too early. If the timeout is too high, you might end up waiting for the DB transaction to complete for a long time.


Application Traces could help you to figure out the optimal timeout value. Also, load testing an application could help you to figure out the optimal timeout value.


Finding the right values is challenging. This is the reason why this strategy has its place after the previous ones


func (s *service) CancelSubscription(ctx context.Context, id uuid.UUID) (*model.Subscription, error) {
    ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
    defer cancel()
    
    tx, err := s.db.BeginTxx(ctx, nil)
    if err != nil {
        return nil, err
    }
	
    defer func() {
        if err != nil {
            tx.Rollback()
        }
    }()
    
    subscription, err := s.fetcher.GetSubscription(ctx, tx, id)
    if err != nil {
        return nil, err
    }

    if subscription.CancelledAt != nil {
        return subscription, nil
    }
	
    subscription.CancelledAt = time.Now()

    err = s.updater.UpdateSubscription(ctx, tx, subscription)
    if err != nil {
        return nil, err
    }

    err = tx.Commit()
    if err != nil {
        return nil, err
    }

    return subscription, nil
}


In the above code, we are using a context with a timeout of 2 seconds. If the DB transaction is not completed within 2 seconds, we cancel the operation and return an error.


We also pass the context to the repo layer so that we can use the context to cancel any in-progress operation and revert the changes.


The defer cancel() call is important. It ensures that the context gets canceled if the function returns early.

Server-Side Timeout

You set an optimal timeout for the request to complete, and if the request is not completed within that time, you cancel the operation and return an error.


server := &http.Server{
    Addr:         ":8080",
    Handler:      router,
    WriteTimeout: 2 * time.Second,
}


In the above code, we are setting a server-side timeout of 2 seconds. If the entire request processing is not completed within 2 seconds, we cancel the operation and return an error to the client. With this approach, you don’t need to create a timeout context for each request.


But, you would still need to fine-tune the timeout value based on the traces and load testing.

Conclusion

I hope you found this post useful. For customer-facing services, having a tiny bug could cause a big monetary impact. Securing code with enough safety nets is something we should do.


And I hope through this post, I made a lot of people aware of what worse things could happen if the transactions are leaking and how easy it is to fix them. 😊


To conclude my take, I recommend using rollback of the transaction in defer always. Finding a reasonable timeout for context and server-side timeout is difficult. So, it doesn't matter if there is an error or not defer with rollback.


Also, I would recommend keeping a Server side timeout for HTTP services as an extra layer.

Not all applications are HTTP based. So, context timeout could also be a good approach for other kinds of applications.


The context in general has a lot of use cases, and I would recommend you read more about it.


You would also be thinking about the client-side timeout. Those are not reliable enough. For example, a client could keep the timeout for a long duration, and with a high load, the server could keep the connection open for a long time.


This could again lead to the same problem we are trying to solve.

References:


I wanna learn this dance 😅