Hey Alexa, Some Things You Just Need to Get Right

One of the interesting challenges you find as apps get larger is that there are certain areas where you really need to get everything right or your app can break in surprising and annoying ways.

For PC applications, one “small” mistake was a hang caused by calling some API on the UI thread that was usually fast but sometimes very slow. There are lots of reasons why an API can vary widely in performance. This would cause a couple problems, as much social or organizational as technical. The developer coding it up and either directly testing it or passing it off to a tester would see the fast behavior on their local machine with a fast local network and local server. But out in the wild, customers would run into slow or broken network connections that could trigger the slow — hanging — behavior. This was especially annoying when most of the app worked but if you navigated into some dark corner you suddenly got stuck. Often the only effective way out was to kill and restart the application.

Finding these problems was one part of the problem, but the other part was then motivating the team to go and fix them. The app appears to work most of the time so its hard to prioritize the work. Getting good data on this behavior was important to helping motivate the work and Office invested a lot over the years to try and get good customer data on these problems. Simulating slow networks internally can be helpful in tracking down these kinds of problems but there is nothing like knowing that 20000 users ran into this last month.

One of the reasons I was such an advocate for making these kinds of APIs explicitly asynchronous is I wanted to drive an understanding of the risks and consequences of this performance behavior as far upstream as possible. Generally you’re forced to deal with the consequences of an asynchronous calling pattern up front as opposed to going back and trying to redesign for a slow synchronous call later as a bug fix. The developer can’t ignore the basic characteristics of the API in getting to a version that works the first time when dealing with the asynchronous calling pattern.

Actually, that’s only mostly true — you can still write broken apps.

The motivation for this little walk down memory lane is a super annoying behavior in Amazon’s Alexa iPhone app. One of the features of Amazon’s Alexa agent is to be able to manage lists, for example grocery lists. “Hey Alexa, add milk to the grocery list.” When we got Alexa this Christmas, the family started experimenting with managing the shopping list using Alexa.

When in the store, you use the Alexa iPhone app to bring up the list and check off the items as you put them in the cart. All pretty nice and simple. The problem is that the grocery store seems to be a dark hole for my cell signal. There are parts of the Alexa app that seem to deal with being offline just fine, but there are other paths that put up a spinning circle and just sit there, not allowing you to proceed until it is finished some theoretically important but mysterious call to the service. This completely breaks the application for this scenario — it is worse than useless since you get sucked into thinking it might work and then halfway through your shopping trip you’re just stuck and cannot even see the list you were looking at a second ago.

Certainly for mobile apps that have clear offline utility for some parts of their functionality, the development team needs to be motivated to recognize that no part of the application is important enough to be allowed to sit there and fail to deal gracefully with inevitable communication failures. This needs to be as much a team cultural issue as it is a technical issue. Shame and embarrassment for releasing an app with this type of behavior in 2018 seems like an appropriate reaction. Positive emotions like pride in workmanship are important but my Catholic upbringing also taught me that shame can be a powerful driver.

There are aspects of this problem that feel like early PC days where apps needed to be super careful about how they allocated memory and responded to out-of-memory conditions. That issue was common enough that having both architectural and testing strategies for dealing with memory allocation failures were important. Often this involved centralizing where memory was allocated from the system in order to explicitly manage it carefully. Testing strategies involved injecting either random failures or deterministic failures to do exhaustive iterative testing of certain critical scenarios.

In today’s environment, network communication has a similar characteristic. Virtually every network request can — and will — fail. Dealing with failure gracefully is a core design requirement. A spinning circle that blocks all app functionality certainly lacks grace.