What is “too long”?

I was just re-reading Leslie Lamport’s classic paper, Time, Clocks and the Ordering of Events in a Distributed System. Most of the paper deals with a discussion of how to manage events ordered by logical clocks rather than physical clocks. Of course since the paper was written there has also been much interesting academic and practical work done on how to leverage well-synchronized physical clocks as well (where obviously all the magic is in defining what “well-synchronized” means and how it is accomplished).

In any case, I was most struck by a passage that he wrote near the end of the paper:

The problem of failure is a difficult one, and it is beyond the scope of this paper to discuss it in any detail. We will just observe that the entire concept of failure is only meaningful in the context of physical time. Without physical time, there is no way to distinguish a failed process from one which is just pausing between events. A user can tell that a system has “crashed” only because he has been waiting too long for a response.

There is a lot of dense wisdom in that paragraph!

When moving from writing serial, synchronous algorithms to writing asynchronous, distributed systems, it’s critical to understand where you can abstract away complexity under a simpler interface and where you need to accept that you are dealing with something fundamentally different. When you call a local, synchronous interface by convention you either get a success or failure response within some bounded time. When you send a message to a distributed component it is inherently asynchronous. Even if there is an abstraction of an underlying connection and a protocol that claims to guarantee a success or failure response, at the physical level you are just sending and receiving discrete messages. Ultimately, you need to be prepared to recognize failure simply because something took “too long”. Importantly, “failure” might have been a failure to receive your message, failure to process your message or simply a failure to provide a response (which leads to the concept of idempotency, but that’s a separate topic).

But what is “too long”?

At this point, it appears you only have two choices — you either use some timeout (either hard-coded, configured or dynamically computed based on some historical data stream) or you propagate (punt) the decision about what is “too long” to higher levels of the system, often eventually the user. The user then decides when to give up or explicitly retry based on their own level of urgency, impatience or external understanding of possible system states available through some out-of-band signal (perhaps they understand that the wireless signal from this corner of the property is always sketchy).

There are a couple common mechanisms we use to work around this fundamental challenge.

The most general strategy is to try to reduce the variability in expected response time between different agents in the system. This doesn’t change the general shape of the problem but in practice makes it possible to use tighter time bounds for timeouts and provides opportunity to have feedback and ongoing progress flow through the system in order to provide input for that challenging “too long” assessment.

A common technique when there is fundamental variability in the response time (e.g. you might ask the distributed agent to do a range of very easy things or very difficult time-consuming things) is to have the agent respond with a simple “got it” response once the request has been accepted and queued. At that point, the distributed agent is in an explicit, discoverable state of processing that request and will later move to an explicit, discoverable state that reflects the completion of the request. Modeling the interaction in this way has the advantage that it reflects the reality of what is happening in the overall system. The guidelines Microsoft developed for REST APIs that fit into the Microsoft OneAPI design pattern advise taking this general approach for requests that have high variance in completion time.

A less effective variant on this technique is to provide ongoing feedback on progress of an operation (either feedback by moving through explicit distinct states, providing some ongoing activity indication or an explicit progress percentage). Underlying this approach, you have the same basic system structure — an agent moving through explicit system states.

My experience is that you’re better off modeling the system explicitly using the first technique. You focus on modeling the system itself rather than modeling the progress of the operations being performed on the system. It typically requires more initial infrastructure but results in a looser coupling between the components of the system. Especially in our multi-device world with agents coming and going and agents with intermittent and varying connectivity, modeling the system as different discoverable states rather than with explicit 1–1 channels of communication and feedback makes the system more robust and flexible.

This perspective that failure only is meaningful in the context of physical time is one of those key places where the real “physics” of the systems we build comes in to play. We live in an Einsteinian universe and the systems we build need to reflect that.