Rethinking MVC in Excel, Word and PowerPoint after 25 Years

I’ve talked about the challenges of building responsive applications in other blog posts (e.g. hereand here). Microsoft Office took a significantly different approach for Word, PowerPoint, and Excel when they built for the modern mobile platforms (WinRT, Android, iOS) compared to the traditional architecture on Win32. I want to talk about what the team did and why they changed the architecture they had been using for several decades. I was leading Office development at the time and was deeply engaged in the design discussions and planning around these issues.

All well-behaved interactive applications try to stay responsive. At its most basic, “responsive” means that the application continues to receive and process messages that the system queues for it (e.g. touch events, mouse events or keyboard events). On more modern operating systems, the stakes are raised because the system will automatically (and aggressively) terminate a non-responsive application in order to protect the quality of the overall user experience on the device. (A side note: the “right” of a mobile OS to aggressively terminate an application is a significant change in the process lifetime contract compared to desktop environments. The changes required to adapt to this were also significant but are a topic for a separate post.)

“Processing messages” is a low bar. In practice, a responsive application needs to actually respond, and in a timely way. What “timely” means varies based on the type of interaction (touch, mouse, keyboard) and has been well-studied and has led to clear guidelines. That is, within some fraction of a second, the program needs to update the display in some way to indicate that it has received and processed the message. This is harder than it might seem. The requirements have become more strict on modern devices because touch-based interfaces typically have the tightest response-time requirements and because modern interfaces include animation (and touch interacting with animated interfaces). The human visual system is very sensitive to anomalies in animation, so once you start animating you want to be able to guarantee a consistent frame update rate (e.g. 60 frames-per-second (FPS) or about 16ms per frame).

Most graphical applications are separated into model and view. These days most models have some persistent representation in the cloud. That persistent representation might be partially cached on a local client and then that partial cache might be partially loaded into application memory. The structure in memory is tuned for generating a view that can be presented to the user quickly and then allowing the user to invoke commands to manipulate the underlying data model. Those manipulations then flow back through the cache and persist to the cloud. The changes to the data model flow forward to result in changes to the view. Ideally to the user it appears as if their actions have direct and immediate effect.

It feels slightly strange that almost four decades after writing my first graphical applications, now writing for machines with multiple processors that are each more than 1000 times faster than those early machines, I still have to worry about these basic issues. Why is this? Why haven’t these problems just disappeared due to the magic of Moore’s Law?

The simplest answer is that while processors have gotten faster, all other elements of the system have also gotten bigger, especially including the size of the datasets that we want to process. Some of those datasets are determined by properties of the device (e.g. depth and resolution of the display) rather than under control of the application developer. For most of our apps, even something like the typical size of a photo inserted into a document changes over time with no change in application features. In fact, the problem is actually even worse than this. A well-behaved application is tuned for a device that has a processor of a certain speed, memory with a certain capacity, latency and bandwidth, storage with different capacity, latency and bandwidth and a network with certain characteristics. The services accessed through that network also have widely varying performance characteristics. As Moore’s Law has played out in all these areas (and related technologies that also exhibited exponential change like storage), the exponentials have tended to diverge so the relationship and balance between these different components have continuously changed. Latency improvements trail bandwidth improvements universally (and exponentially) in all these components for deep physical reasons. Capacity improvements, especially the amazing growth in storage capacity, outstrip bandwidth. It can now take literally days to read the full contents of a modern high-capacity disk drive. I was struck reading about the design of the original Alto workstation where the IO subsystem, the memory system and the processor were tuned so you could effectively (and relatively simply) write code that scanned the entire disk at full speed.

All this means that you need to be careful in scaling up an architecture to take advantage of new device capability. If you naively leverage that additional memory capacity (especially if you use it by filling it from more data on disk) you are going to be slower despite running on a “more powerful” machine. I mention that we “tune” the application for device capability. Tuning can be like balancing a ball at the top of the hill (where small changes in capability can cause big changes in behavior) or it can be like placing it in a bowl — where the architecture tends to be robust to changing device capability. Ideally we are achieving the latter where the application is robust to changes in machine and dataset characteristics.

A key technique in being robust is to isolate behavior that is most sensitive to different and changing device capabilities. I return to my old saw, the emacs and vi repaint algorithm as a good example. In that case, the key constraint was executing screen update commands on the video display terminals of the time to reflect the new view state. The program could spend significant processing time comparing the in-memory representation of the old view state to the new desired view state to pick the best set of update commands to send to the video terminal. The time cost of that in-memory computation was completely swamped by the time spent in the video terminal actually executing the commands to update the screen. By isolating and optimizing this analysis, the behavior of the application across a wide set of device characteristics could be well-defined. In addition, the overall application structure was greatly simplified — code that modified the model did not have to worry about view update at all, secure in the knowledge that any model updates would always be reflected to the view in an optimal and timely manner.

What is old is new again. React, a Facebook library for building HTML (and now native) views, follows a very similar approach. In these web interfaces, the expensive operation is updating the HTML DOM and then the layout and rendering costs that follow. In React, the application generates a “shadow DOM” and React then performs a diffing algorithm to determine the optimal updating steps to propagate the new state of the shadow DOM into the real HTML DOM. Application code is simplified since you don’t have lots of custom hand-written code trying to optimize the screen update locally. There is “wasted” effort in generating parts of the shadow DOM that haven’t changed, but the key insight is that this cost is swamped by the savings in layout and rendering. The value of optimizing layout and rendering (especially preventing multiple duplicate layouts and rendering passes from poorly written view update code) is easily worth any cost reconstructing unchanged parts of the view (even for surprisingly complex views). Find what’s expensive and then structure the program to be able to isolate and centrally control it whether it is updating a view or issuing disk seeks. In the best cases, like these examples, you both guarantee the performance behavior of the application and you can use that performance guarantee to create a simpler application architecture.

OK, let’s get back to the architecture changes in Word, Excel and PowerPoint.

While the Office applications have long used multiple threads, crucially the UI thread — the thread that processes messages — has also been the “app” thread, the thread that has direct access to read and write the application data model. Office has long used a variety of techniques to keep that thread responsive. The most crucial technique is designing a data model that can efficiently interpret and apply user commands. The Office apps also chunk work (e.g. recalculating an Excel model or repaginating a Word document) into bite-sized pieces that can be executed “at idle” in between responding to these user commands. They opportunistically put long-running work that is well-isolated — from an application state perspective — on asynchronous tasks. Despite all these approaches it is still hard to guarantee responsiveness — problems leak in, both due to coding errors and due to the changing assumptions around the device capabilities and the datasets they are asked to process. For modern devices with higher touch responsiveness requirements, a better approach was needed.

Key advance work was done in Office 2013 to separate the composition process into a separate thread. The compositor takes a set of graphical layers and renders them together into the screen image. The compositor is also responsible for independent animation — taking these layers and independently animating them at 60 FPS to ensure smooth movement and transitions. The key design point is that the compositor isn’t asked to do too much — it has a limited dataset to manage and it has one responsibility to ensure that the changes to the screen happen reliably at that 60 FPS.

Touch integration — specifically panning and zooming — has very high responsiveness requirements. “Stick to your fingers” feel requires a tight integration with this compositor layer. Office worked with the Windows DirectManipulation team to design enough of a semantic model into the composition layer tree that this thread could directly implement pan and zoom feedback. Pan is simply a transform on a pre-rendered layer (although typically requires some tiling support so layers can be split up into manageable chunks and some amount of off-screen content available pre-rendered as it moves on to the screen). Zoom can be implemented “optically” (by stretching the bitmap rather than re-rendering the entire scene at a different zoom factor) and then asynchronously requesting and displaying new higher-fidelity layers at the new zoom factor. Critically, very effective feedback can be provided with guaranteed responsiveness based on the dataset available on the composition thread.

This approach worked very well, but required significant enhancement before it could be applied universally. The Office team made the key decision as they started on the modern applications to separate the UI and app threads (and keep the compositor thread as a distinct separate thread). They built on the integration approach they had used for composition and required that the communication between the UI and app threads be based on non-blocking asynchronous queues. They wanted blocking to be impossible by design and therefore wanted to stay away from ad-hoc synchronization between these threads. The requirements of each application are unique enough that this layer that manages the asynchronous queues builds in tool support for applications to design custom data models that synchronize through these queues.

(Note: they do have a few significant places that still drive synchronous access between the UI and App threads, most importantly in order to respond to accessibility hooks.)

Word serves as a good example of some of the challenges involved. In Word, one of the key challenges in providing immediate response to a touch action is to be able to distinguish a touch for panning from a touch for dragging a shape. Any request to the app thread to distinguish these could introduce unacceptable delay in initiating panning. Word constructs a low-fidelity hit-testing tree that mirrors the visual tree it creates for composition. This hit-testing tree is just sufficient for distinguishing between a drag and a pan. Full hit-testing (for placing the insert point into the flow of text or selecting content regions) reaches deep into the application data structures and happens on the application thread. This is a good example of the kinds of trade-offs in managing complexity that needs to be made in the applications. Doing full hit-testing on the UI thread would have involved building up (and synchronizing) much more complex data structures on both sides of the thread boundary. This would introduce additional challenges in constraining the amount of data that needed to be managed across these boundaries and still guarantee responsiveness. There are still basic operations (like typing!) that have no better guarantees than in the prior design. Despite this, supporting key operations with guaranteed responsiveness makes the applications feel significantly faster and more responsive and makes the application experience much more robust to small coding errors.

There is really a hierarchy of approaches and technologies here. The first key element is to separate into threads with well-defined responsibilities. Within each thread, most code can be written to assume it is run serially (although still having significant if somewhat looser constraints around timeliness). In order to communicate across threads, they use non-blocking asynchronous queues. This leads naturally to approaches of eventual consistency between threads rather than “locking and blocking”. The asynchronous queue component layers additional tools on this approach that makes it easier to follow this general strategy, but some applications have found it simpler to use the underlying technologies in a way that is consistent and parallel to this layer. For example, I described Modern Outlook’s architectural approach here which follows the same basic model of well-defined thread responsibilities and asynchronous communication between threads. An important strategy in all these systems is having a very clear model of how asynchrony is leveraged and managed rather than allowing ad-hoc and uncontrolled asynchrony run through the system.

As they continue to evolve these technologies and approaches, they will have to continue to navigate this path of keeping the data model and processing requirements on the UI thread well-bounded in size and complexity in order to ensure the performance characteristics of the UI. This is a layer where the underlying design requirements need to be well-understood by all developers so that ongoing development does not degrade the experience by making the communication in this layer too complex. There are definitely scenarios that could push the team to be tempted in that direction — e.g. Live Drag in Word really needs to run a full layout pass in order to give accurate feedback. It is unlikely that they would really want to try to mimic the complex layout process with a bounded dataset on the UI thread. Knowing which features (and designs) to walk away from in order to maintain overall guarantees of behavior is one of the hard choices in application development.

The work on these changes also clearly exhibited a key characteristic of good system design. The best designs include a combination of top-down factoring with a deep bottom-up understanding of the critical performance characteristics of the primitives the system is built on. You then build up the component model in a way that you can ensure those performance characteristics are robustly maintained end-to-end.