Last evening, a GitHub issue reported a change to the Facebook API causing crashes to happen on many (possibly every) request made from Facebook’s iOS SDK. This meant that every application which uses the iOS Facebook SDK experienced crashes, leading to major outages worldwide. Since this was a bug that was already present in the Facebook SDK and was triggered by a change in behavior in the Facebook API, these crashes happened immediately and simultaneously. This was an unexploded landmine.
Most major consumer applications have some kind of Facebook integration, for example “Log in with Facebook” or “Share on Facebook”. Since many applications are deeply integrated with the Facebook SDK, this bug made such applications crash on every boot - effectively causing a total outage for these applications. Techcrunch reportedthat Spotify, Pinterest, and TikTok were all down.
Since we’re a widely used product that captures crash reports, we know from our data that several popular mobile apps were affected and saw a huge increase in crash volume as a result.
Bugsnag’s systems dealt with this by buffering the huge volume of additional crash reports without dropping them. Once our systems quickly scaled to handle the additional load, and once Facebook rolled back the change that caused the bug, we were able to completely process the backlog of events.
Between 2020-05-06 22:36 UTC and 2020-05-07 02:21 UTC we had delays processing events due to a large increase in event reports from multiple projects simultaneously. The delay was most significant on what we call our “MachO“ queue (where we push events from Apple platform projects: iOS, macOS, tvOS etc). Even with a significant increase in volume causing delay in processing, we were still able to process the majority of events within 5 minutes.
Time |
Activity |
---|---|
2020-05-05 22:22 UTC |
GitHub issue created reporting this event (this is a good thread - I’d recommend reading) |
2020-05-05 22:30 UTC |
A further GitHub issue is created with more interesting responses, again this is worth a read |
2020-05-06 22:30 UTC |
Bugsnag begins to see a huge, simultaneous increase in crash reports from many of our largest customers’ iOS apps |
2020-05-06 22:36 UTC |
Bugsnag starts to see an elevated queue length, and elevated event processing times |
2020-05-06 23:20 UTC | |
2020-05-07 00:11 UTC |
Spotify’s status twitter acknowledges that they are seeing issues |
2020-05-07 00:44 UTC |
Techcrunch posts an article about major app outages, specifically points out that Spotify, GroupMe, Pinterest, and TikTok are down |
2020-05-07 01:00 UTC |
Crashlytics updates their status page acknowledging that crash processing is delayed for Android and suspended for iOS |
2020-05-07 01:27 UTC |
Facebook updates their status page marking the incident as resolved |
2020-05-07 02:15 UTC |
Bugsnag updates our status page to share the queue backlog has been processed and the incident has been resolved |
2020-05-07 02:55 UTC |
Crashlytics updates their status page to report that the Android crash processing delay has recovered but that iOS crash processing was still disabled |
A lot has happened in the last 24 hours and several of us are working on fixes, retros, and pre-emptive planning for when something like this happens again. Just a few weeks ago, an almost identical issuehappened with the Google Maps iOS SDK, which affected Doordash, Uber Eats, and many other apps which rely on maps.
The silver lining about such outages is that it draws attention to good software design and process. It rightly showcases where we need to introduce new best practices or where we may need to fine tune existing ones.
As a community there are some key questions that need to be answered. Patrick Danino from IMDb said it best.