Still Trying to Get Control-C Right?

If you’re a loyal reader (I’m sure there are a couple of you out there) you know that the topic of timeouts, cancellation and asynchrony is a source of endless fascination for me. I came across a series of posts from Nathaniel Smith, author of the Python async library Trio that I found very interesting.

Both Timeouts and Cancellation For Humans and Go Statement Considered Harmful motivate the design of Trio with a detailed analysis of the concurrency, asynchrony, cancellation and timeout mechanisms in a whole range of different languages and systems. Comparing the various systems to the mechanism he’s building in Trio helps tease out differences and issues in the various systems. This makes it an interesting analysis even if you never get anywhere near Python or Trio.

I’m not going to do a full review, but did want to pull out a few points that I found particularly interesting or possibly controversial, both in Trio and in the systems he compared it to.

I thought it funny that one of the selling points for the Trio design is “Control-C works”! That is, if you run a Python program and try to kill it at the command line by typing Control-C, it will properly kill the program. I found that especially funny because when Sun released NFS in 1984, the initial beta implementation did not handle Ctrl-C (SIGINT) “correctly”. To make it work as expected and transparently to local file IO, they needed to make sure that any IO operations on the network file system properly returned EINTR immediately so that the interrupt could then be handled properly by the surrounding context to eventually kill the program. So 33 years later, getting that cancellation behavior right is still a selling point for an async library. Cancellation is pretty hard.
It was especially interesting that one of the challenges for the various languages and frameworks that have tried to define consistent mechanisms is that many of the most basic system primitives do not provide the appropriate control to implement a rational higher-level model. This was a huge problem for Windows since the most basic file handle and networking operations (like CreateFile) did not provide the right level of control to support flexible cancellation models. This is true for other language and systems as well. The lack of appropriate control at the lowest levels tends to propagate up so other libraries and middleware also ignore these issues. All this makes it tremendously hard for an app developer trying to take ultimate responsibility for their programs behavior to get it all right. The consequences of this play out in your daily experience with misbehaving applications.
Smith’s discussion on composability was important — composable behavior is hugely important for these design issues — but I think he misses grounding the design in how failures happen in the real world and the consequences of that on his design. In particular, his discussion on the relative behavior of interval-based timeouts vs. absolute deadlines missed a key point. Most user-facing applications don’t want to use strict deadlines, they want to continue (and quite possibly provide feedback to the user) as long as progress is being made. The distinction between “I’m making (slow) progress” and “I have no connectivity at all” makes all the difference. This means that even if you have a tree of timeout-limited calls, it might very well make sense to apply the same timeout to each individual call rather than slice the timeout between the calls in the way that a strict deadline effectively would. This behavior is actually pretty hard to implement using the mechanism Trio provides because the desire to support both composability and information hiding (so the internal implementation details are hidden from the caller) make it difficult to reset the cancellation condition at the appropriate points.
I’ll also note that if you’re using this in the context of a service / data center environment, your model for failure — and therefore how you want to use cancellation and timeouts — will be very different. The behavior you want if you have a flaky cell signal is very different than the behavior you want if you’re trying to contact a service sitting on the same rack in a data center.
He motivates using what is effectively a stack of global cancellation contexts rather than an explicit context parameter by the failure of so many previous systems that have explicit contexts to uniformly and consistently propagate that context. I think this is a poor motivation. Those consistency failures are a direct result of the fact that it in general was impossible for these middleware layers to do the right thing anyway because the underlying system primitives they would end up calling didn’t provide the appropriate control. So that failure is not really a fair test. I am a much bigger fan of both the control and very clear semantics that an explicit parameter system provides. It is too easy for code that uses an implicit global context to simply fail to consider the design issues.
For effectively the same reason, I did find the Trio “nursery” concept interesting. Most systems that define concurrency mechanisms essentially treat any new thread of execution as occurring at the global context. The Trio nursery requires a task to have an explicit scope. Of course it could be misused by creating all tasks within a global nursery, but overall has a number of useful consequences that he describes in his analysis.
I also enjoyed the link to the discussion between edge and level-triggered designs. I discussed this in “Defense in Depth or Hack?” in the context of event vs. state-based APIs which is the same issue in another guise.

If this whole topic interests you, I encourage you to dive more deeply in the links above.