261 reads

Keeping Timeouts Short

by Terry CrowleyApril 15th, 2018

Too Long; Didn't Read

Users hate to wait. They especially hate to wait for wildly varying and unpredictable amounts of time with little feedback or progress indication. This has been well-understood in the design of purely local desktop applications for decades and is just as true for distributed applications, whether desktop, web or mobile. The user expectation is no different between a local or distributed application but the design challenges are deeper and more fundamental when the application is distributed.

Keeping timeouts short is a core strategy for addressing this challenge. I discussed the key role of timeouts in my post “What is too long?” An alternative and perhaps more actionable way of thinking about the same strategy is to design your system so as to reduce the expected variance in the processing of each request. Part of this strategy is also to make it relatively inexpensive and safe at both the client and the service to give up — timeout — and retry a request or quickly propagate the error to higher levels in the system.

The challenge in designing an application is often wrapped up in this core task of designing a data model and the operations allowed on it so that the execution time of each operation is both short and predictable. Inevitably, trade-offs need to be made and there arise some set of operations that cannot be performed efficiently. What do you do in this case?

I want to use some examples from FrontPagewhere we did not do it correctly. As is true for much of life, lots of learning comes from your mistakes.

FrontPage was designed from the start as a distributed application. The FrontPage Server Extensions were integrated with a variety of web servers. The extensions provided both the website editing services used by the FrontPage desktop client as well as dynamic browse-time functionality (discussion lists, guestbooks, page hit counters, etc.) when the site was accessed by a browser.

Most editing operations involved either saving a single page to the server or setting some property in the website metadata. Both these operations could be performed as constrained operations in the context of the overall data model and with predictable latency. The first operation that changed this operating characteristic was also one that was key in distinguishing FrontPage as a web _site_editor rather than a web _page_editor. A user had the ability to rename a page. This apparently simple operation in the client user interface launched a flurry of activity on the server to automatically update any page that contained a link to the renamed page so it correctly pointed to the new page location.

The challenge this introduced was that this renaming operation was neither localized nor predictable in running time. In principle, every page in the site might need to be updated. No matter how fast this became (and we worked hard to optimize this processing) it inherently had high potential variance.

In retrospect, this feature would have served as the ideal driving motivation for developing a holistic strategy for dealing with the general problem of unbounded editing operations. As any application grows in functionality, some features can be implemented gracefully within the current architecture while others stretch and push the architecture in unexpected directions. Sometimes these are unique one-off features that don’t warrant large architectural investments while in other cases they serve as motivating features for extending the architecture to gracefully handle a whole new class of functionality.

In this case, we would go on in later versions to implement additional site-wide features around theming, structural templates, automatic navigation bars and site maps that all had this same characteristic of unbounded processing requirements and therefore wide potential variance in processing time.

Underlying all this was a significant constraint on architectural strategy. FrontPage was designed to be able to integrate with other website functionality by persisting individual pages of the site to disk so they could be edited or processed with other tools and served directly by the underlying web server. This meant that certain common approaches for optimizing processing time that might be applied within an application that had full control over the application data model were unavailable. A simple example is my renaming operation — one trivial approach for implementing this efficiently is providing a layer of indirection where links to pages go through an indirection table that allows for efficient renaming. This would have resulted in non-standard links and page naming that was an unacceptable trade-off given our higher level design goals.

The consequences of unbounded processing were significant. Certain operations that were relatively easy to invoke and seemingly trivial in the user interface (like rename) would involve long-running interactions with the web server. In some cases these would run for so long that the web service or some intermediate gateway or proxy would break the connection, leaving the user (and the application) confused and uncertain about the actual state of the remote web site. HTTP, the protocol underlying the interactions with the remote service, is generally designed for short running interactions; this type of long-running usage was unusual. Additionally, the web service would lock out other processing while the change was executing. This meant that further editing operations, even localized ones like saving a single page, needed to wait for the completion of this long-running operation. The consequence was that virtually every operation had high variance. This was especially painful as sites grew larger and as multiple people edited a single site concurrently. This long-running operation would lock out all editors, not just the single person who had invoked the operation.

The general strategy for addressing this kind of problem has a simple and complex part. The simple part is to change the processing of the operation to one that simply queues the request at the service and then immediately returns with a “got it” response. The service is now responsible for completing the long-running operation at some later time.

The complex part is that modeling the state of the website has gotten considerably more complex. Instead of simply being some metadata and a set of pages on disk, the website now needs to be modeled with a queue of pending operations and the attendant processing. Ideally the service would like to allow other operations, especially short localized ones, to proceed to completion without waiting for these long-running tasks to complete. This means the system designer now needs to consider the semantics of interleaving operations. A benefit of this approach of operation queuing is to allow relatively simple batching. For example in this case, if the user invokes a series of operations that all involve rewriting the same sets of pages (e.g. applying individual theming properties like font or color as separate operations or switching back and forth between different themes) processing can relatively easily be optimized to combine the operations in a way where each page is only rewritten once.

Another advantage of immediately returning is that it forces the user experience designer to explicitly address how the application should behave during this period. The default and all to frequent answer when some operations take a long time is to simply make the user wait.

When addressing the modeling complexity challenge, there are really two basic strategies although a single system will often combine elements of both. One approach is for the service to attempt to hide the underlying complexity behind its service interface. So for example, a “get” of a page through the editing API would first ensure that any pending processing for that page was complete before returning it to the client. From the client’s perspective, it appears as if all the prior processing has completed synchronously.

This is essentially the “lazy evaluation” design pattern. This is a beautiful strategy when it works. The challenge with this is captured in the common bit of folk wisdom (well, actually a line from a Sir Walter Scott epic poem) “Oh! What a tangled web we weave, when first we practice to deceive”. Truly hiding the actual underlying state gets hard and in some cases impossible. So for example we beautifully hide this complexity behind the editing APIs but the user directly browses to the website and is confused when some of their changes are missing. As new features are added, new lies need to be told and it all gets unwieldly, complicated, expensive and leaky.

The alternate strategy is to directly expose the true underlying state to the client application and then in some form to the user. The user gets some exposure to the true state of affairs that all downstream processing has not completed (they might want to know this before explicitly publishing the site or announcing their changes). A hybrid approach might be implemented as a shared responsibility between the service and the client — so if the user opens a page, the client will locally apply any pending global theming changes before presenting the page to the user. The user sees a consistent experience but it is implemented as a pragmatic joint effort between the service and the client application. Relevant aspects of the underlying complexity are allowed to leak through and intentionally exposed to the user where useful.

These issues really play out both locally and remotely, but additional latency, error categories and states, multi-device and multi-user access requirements all complicate the distributed case. For example, the asynchronous and lazy execution of repagination in Word or recalculation in Excel addresses a similar concern in a purely local context.

My infatuation with the end-to-end argumentmeans that I am very leery of techniques that claim to perfectly hide deep underlying complexity behind some intermediate layer. You go down a path of explicitly creating a beautifully opaque interface and paint yourself into a corner where you have not built up the infrastructure to allow the “right” information to leak through. I talked about these challenges in my “Beware Transparent Layers” post. This is also highly related to arguments about whether “worse is better”. The worse-is-better argument places high value on implementation simplicity. The underlying motivation is a belief (born out by painful experience) that underlying implementation complexity will inevitably leak through. One of the most pernicious form of leakage is exactly the performance anomalies I am focused on in this post.

The reality that there are “strategies” and “arguments” rather than clear and unambiguous guidance for addressing these kinds of issues reflects the complex decision making that often underlies the task of designing systems. These types of challenges are exactly what makes the programmer’s job so fun.