Building mobile apps is one of the most interesting software engineering jobs ever to exist. Shipping new features while solving challenging problems is a fulfilling process, one that cannot be replaced. On top of that, getting 5-star reviews from your users is rewarding and makes you feel like the hero of the day.
Well, all the above is and remains true, until the day the small but formidable bug creeps into your app. Your eyes turn red, the party hat gets lost and the once-calm Slack app erupts in chaos. I am not a big fan of horror movies, but I’d rather watch The Conjuring than be woken up by this chaos. Inevitably, no matter how good you are, doomsday always seems to find its way to you, and the race to find and get rid of the bug starts.
3 months in the job, a budding junior android developer is pulling his weight on a team of 3 and doing his best to ensure the app he is working on was of high standards. Coupled with the desire to learn and emulate his seniors, he considers himself unstoppable. Up to now, all his changes have been approved and merged with minor issues. If you have ever been a junior(Of Course You Have😀), having such a smooth run is rare and thus he had earned all the Senior Engineers’ trust.
On a Friday afternoon, a tiny but highly prioritized task pops up on the board and since the junior dev was working on some minor changes, he was reassigned to the task with a note: “It should be a hotfix to make it to production at end of the day”.
One of the most important concepts in Software Engineering is storing data. Even though there are many efficient ways to do that, relational data still reign supreme. In android development, a common way of persisting data is using SQLite and room provides a wonderful abstraction to interact with.
The simple task entailed renaming a column as well as adding a new one that could hold a null value. Believe it or not, that was the only thing to be done and within a few minutes everything looked to be in order. The task’s PR request was merged as soon as it landed. As it made its way to production, everybody else was getting ready for a fun weekend ahead.
It’s a cool Saturday afternoon and I had figured it was a good time to catch up on my favorite series, “The Blacklist” when my phone became restless and notification pings started raining. One cannot mistake Slacks notification sounds which becomes intimidating when you get so many of them in quick succession. It always feels like the call “AVENGERS ASSEMBLE”.
“Users cannot open the app.“ This was the message causing chaos all over. So I did a quick installation of the app again, and to my surprise, the app was working fine on my end. Two more colleagues did the same and everything seemed to be okay until another teammate reported the issue too.
Not knowing how deep the problem goes is usually worse than not knowing what the actual problem is. Off the top of my head, I was thinking maybe it was related to specific devices but there were so many diverse devices reporting the issue for this to hold. Is it an android version-related issue? I asked trying to brainstorm what is happening while going through the recent code changes. This was responded to with a “Highly Unlikely” answer as almost all Android versions had reported the issue.
Looking at the app’s traffic and access logs, it was pretty clear some users could still use the app while others couldn’t open it. A quick run of the code still couldn’t pinpoint the issue as the app worked as expected.
30 minutes into the battle, I decided to review all the information we had gone through trying to check for the pain point. The app already had a few hundred uninstalls over the past few hours. Checking the number of users who had updated to the latest version, it was more than the total number of app uninstalls. This in itself was not out of the ordinary but for an app with a few thousand users, the numbers were quite extra considering all these activities happened within a short time span.
A quick chat with a backend guy brought things into perspective. They were having an unusual number of high login requests from the app which was out of the ordinary as the app only needed the user to log in once.
And then a thought crossed my mind, what if the current users are forced to re-install the app and thus need to log in again? Since the app’s previous version was working, this must be happening to users who had just updated the app. Quickly, I jumped into the release archives, installed the previous version, and updated the app. And VOILA, the elusive crash happened in front of my eyes.
Looking at the crash logs, every piece fell into place. And you can all guess whose fault it was. I had forgotten to write migrations for the room database after adding and renaming a column. On top of that, I had not written comprehensive tests to check for this situation. After all, you can’t test what you don’t know. You can have a further read about the issue here.
More than half of the total active users had updated the app and most had to reinstall the app to be able to access the app’s services. A few did completely uninstall the app and never returned. A few negative reviews and 1-star ratings popped up on the Play Store as well. Overall, it was a devastating weekend, and not only was I hurt but the business was too.
Onward, the test coverage threshold was raised from 60% to a complete 100% to catch such extreme scenarios, and later on, updates started being released to users in batches to ensure issues didn’t affect all of them simultaneously.
Personally, this was my first major programmer error and still haunts me to date. Fortunately, I worked with a great team that had my back the whole period. Lastly, DON’T PUSH CODE TO PRODUCTION ON A FRIDAY is not just a Twitter mantra, but a potential curse to any engineer who decides to break it.