I had an interesting problem to solve recently. Being just out of college and seeing the first instance of something that was learnt, in the real world scenario was kind of exciting.
The thing is that there were occasional orders and some API request getting failed within our e-commerce Rails app. From the exception handles we found out that it was caused by MySQL errors.
"Deadlock found when trying to get lock; try restarting transaction"
"Lock wait timeout exceeded; try restarting transaction"
This was bad as something which is an integral part of the business should not fail. So my first priority was to isolate the cause of this. From the logs, I was able to backtrace to which line of code was failing and also find out that both of the above errors were for the same line of code.
Suppose there are two transactions which need to acquire some lock so that it can commit the transaction.
Transaction A: Locks row X and awaits to lock row Y.
Transaction B: Locks row Y and awaits to lock row X.
But these two transactions holds the lock on the row that the other requires to function, thus creating a situation wherein both ceases to function.
In such cases, MySQL times out one or both of the transactions and the locks are released.
But the thing with Rails is that you have the ORM, transaction enclosed callbacks and I often forget about the SQL, underlying database protocols and ACID properties it has to maintain. It’s all Rails magic.
To the muggle world now!
Three things could be deduced now:
So the next step was to find out the other transaction which is locking the rows causing timeouts and deadlocks.
In the MySQL console execute the query:
SHOW ENGINE INNODB STATUS;
This gives you the state of the InnoDB storage engine. It has different sections of information among which you can find the information on latest detected deadlock. You can read more on this here.
Here is a sample deadlock section:
------------------------LATEST DETECTED DEADLOCK------------------------141016 15:39:58*** (1) TRANSACTION:TRANSACTION 3D009, ACTIVE 19 sec starting index readmysql tables in use 1, locked 1LOCK WAIT 2 lock struct(s), heap size 376, 1 row lock(s)MySQL thread id 2, OS thread handle 0x7f0ee43cd700, query id 78 localhost rootupdatingDELETE FROM t WHERE i = 1*** (1) WAITING FOR THIS LOCK TO BE GRANTED:RECORD LOCKS space id 0 page no 2428 n bits 72 index `GEN_CLUST_INDEX` of table`mysql`.`t` trx id 3D009 lock_mode X waitingRecord lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 00: len 6; hex 000000000700; asc ;;1: len 6; hex 00000003d007; asc ;;2: len 7; hex 87000009560110; asc V ;;3: len 4; hex 80000001; asc ;;
*** (2) TRANSACTION:TRANSACTION 3D008, ACTIVE 69 sec starting index readmysql tables in use 1, locked 14 lock struct(s), heap size 1248, 3 row lock(s)MySQL thread id 1, OS thread handle 0x7f0ee440e700, query id 79 localhost rootupdatingDELETE FROM t WHERE i = 1*** (2) HOLDS THE LOCK(S):RECORD LOCKS space id 0 page no 2428 n bits 72 index `GEN_CLUST_INDEX` of table`mysql`.`t` trx id 3D008 lock mode SRecord lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 00: len 8; hex 73757072656d756d; asc supremum;;
Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 00: len 6; hex 000000000700; asc ;;1: len 6; hex 00000003d007; asc ;;2: len 7; hex 87000009560110; asc V ;;3: len 4; hex 80000001; asc ;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:RECORD LOCKS space id 0 page no 2428 n bits 72 index `GEN_CLUST_INDEX` of table`mysql`.`t` trx id 3D008 lock_mode X waitingRecord lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 00: len 6; hex 000000000700; asc ;;1: len 6; hex 00000003d007; asc ;;2: len 7; hex 87000009560110; asc V ;;3: len 4; hex 80000001; asc ;;
*** WE ROLL BACK TRANSACTION (1)
I admit that I was quite overwhelmed seeing the whole bunch of information. But then, if you squint a little it all starts to make sense. Within the section you can understand what are the transactions which had caused deadlocks, the type of lock that was held and was waited for.
Things you can infer from this chunk of information:
[_RECORD LOCKS_](https://dev.mysql.com/doc/refman/5.5/en/innodb-locking.html#innodb-record-locks) SPACE id 0
, more importantly, transactional query for WAITING FOR THIS LOCK TO BE GRANTED
and _HOLDS THE LOCKS(S)_
.lock mode S
and lock mode X
above_._ It’s necessary that you understand how locking works so that you you know why deadlock happened.Voila! Now you know what transactions are causing the trouble.
A web application is a fairly complex system of hardware, software and networking. There can be multiple users requesting the web app simultaneously. And your app must be coded in a way such that it can handle concurrent data access. If this is not done properly it can result in time-outs and deadlocks.
Also, App server can have multiple processes such that your host can run Ruby code for multiple requests concurrently. Therefore the two main tools for you to handle concurrency is transactions and locks.
Keep in mind the following points to avoid deadlocks
2 . Care about ActiveRecord callbacks that you use.
Use after_commit
for a long running code. For example:
class SailBoathas_many :passengersafter_create :get_passenger_listdef get_passenger_list# associating all passengersendend
Now the callback opens up a transaction block for a large period of time and does associations. This is bad as this locks up rows and invites deadlocks.
3. Make sure the common workflows always acquires lock in the same order.
4. Use queues to prevent concurrent data access.
5. Check whether your Application use-case requires what mode of Isolation level. By default, INNODB uses REPEATABLE READ
.
Transactions holding the lock was from an external service that was being hit to our application. Also thanks to Unicorn, these requests happen concurrently and increases the timeout/deadlock probability. Why? Because these were heavy transaction borne requests and sometimes it goes well beyond the default 50 sec lock wait or will acquire the locks just to create a deadlock.
From INNODB STATUS
I was able to figure out the query that HOLDS THE LOCK(S)
and that was from a stored procedure that the API request invoked. The lock acquisition within the order flow and the stored procedure were the reverse. Making them the same order was a lot of change. But introducing a new TEMPORARY TABLE
within the stored procedure such that, rather than joining two indexed table directly, joining it through an indexed temporary table solved the problem from re-occurring.
And for the deadlock/timeout API request that’s been failing, I simply implemented a queue with retry block and pushed them to it.
Whew, I guess I am done here! It was quite a learning curve and pretty cool understanding the how database hustles around with each queries keeping true to its properties. Also feel free to correct me in case of any mistakes made or share your thoughts in the comments.
Have a good day. Code Long and Prosper!