Have you ever wondered what to do when the very tool you use to predict and handle crashes, Firebase Crashlytics, encounters a problem itself? You might think it's an impasse, but don't worry – we will do some detective work in this post. I have come across a unique deadlock within Firebase Crashlytics' urgent mode. After some deep digging, I've found an unexpected yet efficient solution, drawing inspiration from an unlikely place – XCTets' "expectation" implementation.
Let's begin with revealing what the "urgent" mode is. We are spending countless hours testing and fixing bugs before deployment. But then, something unexpected happens, and your app crashes upon launch! Not a single request will not be able to send and inform you about that incident. But how do you know the reason for that crash if it was not reproducible?
Firebase Crashlytics comes to the rescue! It has a feature that detects a crash during app startup. If that happens, Crashlytics will pause Main Thread pause initialization to prevent it from crashing; The crash info will hopefully be sent to the server before the crash happens again. The name of that feature is "urgent mode."
Let's jump back to the issue at hand. I observed my app was taking an unusually long time to launch. To dig into this, I used lldb
to pause my app and examined the issue in detail. As I went through the stack, it didn't take long to spot the culprit: Firebase Crashlytics was interrupting the launch process.
The function regenerateInstallIDIfNeededWithBlock
had appeared on the Main thread. This was odd because if you use a symbolic breakpoint, you'll notice that regenerateInstallIDIfNeededWithBlock
is normally invoked from a background thread, not the Main thread. This unusual shift was a clear red flag that the expected process flow was off.
Now, let's unravel this deadlock situation. A close examination reveals that regenerateInstallID
is preceded by prepareAndSubmitReport
, which is itself preceded by processExistingActiveReportPath
.
Let's dive into the code to understand it better.
- (void)processExistingActiveReportPath:(NSString *)path
dataCollectionToken:(FIRCLSDataCollectionToken *)dataCollectionToken
asUrgent:(BOOL)urgent {
FIRCLSInternalReport *report = [FIRCLSInternalReport reportWithPath:path];
if (![report hasAnyEvents]) {
// call is scheduled to the background queue
[self.operationQueue addOperationWithBlock:^{
[self.fileManager removeItemAtPath:path];
}];
return;
}
if (urgent && [dataCollectionToken isValid]) {
// called from the Main thread
[self.reportUploader prepareAndSubmitReport:report
dataCollectionToken:dataCollectionToken
asUrgent:urgent
withProcessing:YES];
return;
}
The "urgent" parameter determines whether the code will run in the background or on the Main thread. Submitting a report from the Main thread seems like expected behavior.
The regenerateInstallID
waiting for the semaphore to signal, which should occur when [self.installations installationIDWithCompletion]
is completed. The code of regenerateInstallID
looks like this (for the sake of brevity, the code is simplified):
- (void)regenerateInstallID {
dispatch_semaphore_t semaphore = dispatch_semaphore_create(0);
// This runs Completion async, so wait a reasonable amount of time for it to finish.
[self.installations
installationIDWithCompletion:^(void) {
dispatch_semaphore_signal(semaphore);
}];
intptr_t result = dispatch_semaphore_wait(
semaphore, dispatch_time(DISPATCH_TIME_NOW, FIRCLSInstallationsWaitTime));
}
To figure out why the completion does not fire, I've dug down in the chain of calls to the installationIDWithCompletion
and did not notice any path that could ignore the completion.
The real issue revealed itself when I noticed the completion wrapped in a FBLPromise.then {}
block. This block is dispatched asynchronously on the Main thread, as shown here:
@implementation FBLPromise (ThenAdditions)
- (FBLPromise *)then:(FBLPromiseThenWorkBlock)work {
// Where defaultDispatchQueue is gFBLPromiseDefaultDispatchQueue by default
return [self onQueue:FBLPromise.defaultDispatchQueue then:work];
}
@end
static dispatch_queue_t gFBLPromiseDefaultDispatchQueue;
+ (void)initialize {
if (self == [FBLPromise class]) {
gFBLPromiseDefaultDispatchQueue = dispatch_get_main_queue();
}
}
So, the deadlock essentially boils down to this: A semaphore is waiting on the Main thread for a signal from the completion handler to release it, but the completion handler itself is stuck, waiting for the main thread to execute dispatch_async
. This circular dependency was causing our app launch to stall.
So, what options are we left with?
We could pass a queue to the promise if we wait for completion on the Main thread. However, this approach would require proposing a new interface to FBLPromise.
We could alter the default queue for all promises. This, however, is a risky move that would affect every call in the SDK.
With my preference for containing bug fixes in their local context to avoid introducing new bugs, I chose not to tweak FBLPromise. Instead, I looked for a solution that would be minimal and confined to this particular case.
If only we could execute an async callback on the Main thread while simultaneously waiting on it... Sounds familiar? Well, it should! We do have this capability in XCTest viawaitForExpectations
.
Here's an example:
// This test will pass
func testExample() throws {
let testExpectation = expectation(description: "")
DispatchQueue.main.asyncAfter(deadline: .now() + 0.5) {
testExpectation.fulfill()
}
assert(Thread.isMainThread == true)
waitForExpectations(timeout: .infinity)
}
Intrigued, I delved deeper into the XCTest framework's source code to understand how it does that trick.
Here's the related piece of code:
func primitiveWait(using runLoop: RunLoop, duration timeout: TimeInterval) {
let timeIntervalToRun = min(0.1, timeout)
runLoop.run(mode: .default, before: Date(timeIntervalSinceNow: timeIntervalToRun))
}
Surprisingly, I discovered we could handle dispatched callbacks on the current thread using a nested RunLoop spinner. This seemed like a promising way out of our deadlock.
To address this deadlock, the code was adjusted to implement a run loop spinning mechanism instead of the semaphore while running on the main thread. This tweak allows dispatch_async to signal the main thread to continue execution, preventing it from blocking.
- (void)regenerateInstallID {
dispatch_semaphore_t semaphore = nil;
bool isMainThread = NSThread.isMainThread;
if (!isMainThread) {
semaphore = dispatch_semaphore_create(0);
}
[self.installations
installationIDWithCompletion:^(void) {
NSAssert(NSThread.isMainThread, @"We expect to get a completion on the main thread");
completed = true;
if (!isMainThread) {
dispatch_semaphore_signal(semaphore);
}
}];
intptr_t result = 0;
if (isMainThread) {
NSDate *deadline =
[NSDate dateWithTimeIntervalSinceNow:FIRCLSInstallationsWaitTime / NSEC_PER_SEC];
while (!completed) {
NSDate *now = [[NSDate alloc] init];
if ([now timeIntervalSinceDate:deadline] > 0) {
break;
}
[[NSRunLoop mainRunLoop] runMode:NSDefaultRunLoopMode beforeDate:deadline];
}
if (!completed) {
result = -1;
}
} else { // isMainThread
result = dispatch_semaphore_wait(semaphore,
dispatch_time(DISPATCH_TIME_NOW, FIRCLSInstallationsWaitTime));
}
}
Although the proposed solution worked, the maintainers of the Firebase SDK discovered an even more elegant and streamlined solution. They found that calling regenerateInstallID
was not required. The most straightforward fix is the most effective, sidestepping the need for complex or "big-brained" solutions. And I want to highlight the importance of constantly refining and enhancing our solutions to focus on simplicity and efficiency in our code.
Understanding and preventing deadlocks is key to keeping your app responsive. Tools like run loops, locks, and semaphores can help manage tasks across multiple threads, but they can also make things complex and cause deadlocks if not used correctly. When using these tools, it's important to avoid potential issues like race conditions and deadlocks. Keep your code simple, make sure to always balance semaphore waits with signals, and try not to hold locks during lengthy tasks. Applying these concepts correctly can help your app stay responsive and provide a smooth user experience.