Back in 2016, while running my startup, we faced a critical moment that felt like a doomsday scenario. At that time, I believed we were on the verge of going out of business within a matter of hours.
Everything started because at that time we were working on a second, empowered, version of our data sync engine. We were in fact developing an iOS mobile app capable of storing data locally, fully in sync with our backend.
The sync engine was designed by our own team in the earlier days of our startup and after a year we decided to heavily refactor it to basically improve its speed and resiliency within specific conditions.
One of the things that we did was to decide on the behavior of the sync engine depending on the different status codes that our backend could return, related to the type of request that the app could submit.
Below is a lookup table containing a draft of the schema we created. This picture was taken a few days after we narrowly survived the consequences of not properly implementing the complete handling process outlined in the schema.
Sp, after weeks of programming, the day came and we rolled out the new sync engine in production. I have to say that at that time we did not have iOS automatic tests. We did quality assurance tests before the release but we did not have routines in place that were verifying the logical flow we designed for the new sync engine. That was, of course, carelessness on our part.
Just one hour after releasing our app, customers began opening tickets, complaining about weird behavior. Specifically, one of the app's features was the ability to track deals pipelines, which means storing the movement of deals through different stages, from the moment a deal opens until it closes (either won or lost).
Suddenly, all of our customers' pipelines were compromised, as all deals were brought back to the initial stage, resulting in a complete loss of control over the steps in the pipeline that signal to the sales team that a deal is moving towards the closing point.
For us, it was a tremendous issue, as it was impacting one of the core elements of the app.
We were lost and scared.
But we reacted.
We took a series of actions that, in a couple of hours, solved the problem and brought the situation back to normal. What we did were the followings things:
Initially, we looked into Sentry to check if our iOS and backend apps were reporting any bugs, but we found nothing. Some might consider this information useless, but it actually helped us focus on the situation.
It became apparent that we were likely dealing with a logical issue related to the sync engine. This meant that the situation was worse than we initially anticipated because finding and resolving the issue was going to be much more challenging than we had imagined.
Luckily one of the things we put in place from the very beginning of our startup was a flag within our iOS app allowing us to fully lock the usage of the app (including any data sync, even in the background) remotely. So, we turn on the lock and all the apps of our customers were put immediately in a condition of not operating, a sort of maintenance mode.
That would have temporarily irritated our customers, but it’s far better than having their data compromised for a still unknown reason.
Our company went into emergency mode as our team of dedicated engineers and developers worked to resolve our issue. We thoroughly analyzed our sync engine to identify any logical deficiencies while also verifying that our implementation aligned with other app policies and behaviors.
After two hours of reasoning, investigations, and experimentation we finally detected the issue. Without going into the details, it was sufficient to know that it was a mix of an implementation of the sync on the iOS side that was not consistent with the REST APIs schema aforementioned and incompatibility with another data management function present in the iOS app code.
Once we fixed the app, we proceeded with the release and impatiently waited for iOS approval.
When the updated version of the app was available in the App Store, we disabled the app lock and we contextually forced the app to be updated. This means that our customers were all together requested to go to the App Store to download the last version of our app, which contained the fix for the infamous deals stages compromisation.
We were able to achieve this success due to an excellent preventative measure we already had in place in our app - a date parameter that was checked at boot and during every sync. This parameter informed users when an update was necessary to continue using the app. By setting the parameter to the date of our latest release, we ensured that all active customers had downloaded and were using the expected version of the iOS app.
We solved the bug and brought returned the app’s operation to normal, but there was still another critical thing to tackle: restoring the deals of stages that were corrupted by the sync issue.
Again, we were lucky enough that a couple of weeks before we implemented a tiny customer logger at the backend level to keep track of any change reflected on our database by the sync activity. The Ruby function was quite a simple but effective one, that was meant to basically save the type of CRUD method, the timestamp, the user (i.e. the customer), the target entity, and the changes applied (storing the old and new value).
def self.create_activity(op, user, entity, params={}, changes={}, key='self', extras={})
data = {
:op => op,
:key => key,
:user_id => user.id,
:occurred_at => Time.now.utc,
:entity => entity,
:params => params,
:changes => changes
}
extras.each do | k, v |
data[k] = v
end
Delayed::Job.enqueue(Jobs::QueuetaskJob.new("activity", data))
end
Hence, based on the history of the activities, it was quite fast to write down a script that grouped together all the deals whose stages were changed after the toxic iOS app release and iteratively rolled them back to the stage value they had before that event.
To conclude, William Arthur Ward was used to say that The pessimist complains about the wind; the optimist expects it to change; the realist adjusts the sails.
I'm pleased to say that our team demonstrated strength and resilience in responding quickly to a critical event that could have permanently compromised the reliability of our app and our business. We remained level-headed and optimistic, believing that we could overcome the challenge through teamwork and collaboration.
Fortunately, we had access to a range of effective tools that helped us efficiently manage the situation and reach a resolution point without delay. This experience was a crucial turning point for us, demonstrating our ability to handle unexpected challenges with agility and confidence.