From compilers to web apps, databases to caches, maintaining computed state is the most prevalent and thorny problem I’ve faced as a developer. Most of this article is about that problem, but skip below to “A Tiny Contribution” if you just want to see some code.
Many of us developers are familiar with this famous quote from Martin Fowler:
There are only two hard things in Computer Science: cache invalidation and naming things.
Cache invalidation is an incredibly hard thing to get right, to be sure; but that’s not the point Fowler is making. “cache invalidation” is amusingly over-specific: the quote challenges you to understand why it’s so hard.
Cache invalidation is hard because managing computed state is hard.
I certainly am not the first person to talk about this problem, but I had a lot of trouble finding someone talking about this problem in the general case and giving it a name. So, for the moment, I am calling this the “model-view problem.” You may be familiar with the terminology “model” and “view” from various UI frameworks; I am using these words in an analogous way, but as you will see in the following examples, I employ them here with a more general intent.
The Model-View Problem
Starting on the left side of this diagram, we begin life with a modest model, M, and a view, V, which is derived from M with a view function.
Then, something happens: M changes and we are left with a new model, M′. Here, we find ourselves at a critical juncture. How should we determine the new view, V′? We have two possible paths:
From a code maintenance standpoint, we’d like to only define the view function. However, from an efficiency standpoint, we’d prefer to somehow derive the change in the view from the change in the model. This tradeoff is the crux of the model-view problem.
I assert that the model-view problem is one of the most common and difficult issues in development today. Let’s look at some examples.
In the old days of Javascript, we had to evolve both our objects and our DOM elements in parallel: we were stuck in adventure mode, whether we liked it or not. We were mostly right to dislike it, and next generation Javascript frameworks are all about making the view function more explicit, while retaining as much performance as possible.
Underneath our fancy web app, our browser has the same problem — given this change in the DOM, how shall I update the screen? SVG renderers have this problem in spades. But this problem isn’t limited to the web.
Desktop OS: When an application updates its window, do I need to update the screen? Which parts?
Photoshop: Advanced image editors have to efficiently update the final product, which is composites several source layers.
Games: What parts of the scene can be excluded from rendering, based on how the user is facing? What can I pre-compute, and how do I know when to update that computation?
Productivity Apps: Whether you’re working with a letter, spreadsheet, or slide deck, the hard parts about these apps all boil down to inferring a change in the final product based on changes to the source data.
Segal’s law is an adage that states: “A man with a watch knows what time it is. A man with two watches is never sure.” Keeping multiple copies of your data is dangerous.
Caching: The cache invalidation solution is a bit of a hybrid path: it is all about figuring out what parts of the replica need to be refreshed from the model, but selecting them by looking at the state change.
Storing state in the client: Nothing sinks your shiny, new mobile app faster than bolting on “offline support.”
Database indexes: Just be thankful the database is implemented for you; it’s got to correctly implement a cascade of updates stemming from foreign keys and secondary indices on arbitrary changes.
Denormalizations: This is what you wish the database had implemented for you. Incrementally updating denormalizations is very hard to get right.
A fatter, uglier cousin of “replicated state” is “computed state.” Here we have perhaps the most common and damaging incarnation of the model-view problem.
I am using “Computed Data” in this diagram to refer to aggregations and any other data you put in your database that could have been inferred from other data in your database.
If you’re on the high road, you’re probably running massive “batch jobs” nightly, and hoping that your boss is old enough to think that sounds reasonable. If you’re in adventure mode, you’re probably explaining to your boss how FudgeEater666’s notification count might be permanently wrong because the local network dropped out at the wrong time.
We are victims of the model-view problem as end users, as well.
When it comes to builds and tests, nearly everyone takes the high road. However, modern build systems are getting increasingly granular, sandbagging against the tide of preprocessors and packagers.
This is one example where, if we could have the benefits of both paths, our lives would all be much better. Too long have we waited for the compilers and tests to run nearly identical workloads over and over again, in a special purgatory we share with our CPUs.
Eight years ago, Damien Katz had a solution for the model-view problem in CouchDB; it was brilliantly simple, although not completely general. The core idea is that if you can express your view as a reduce function, then you can store intermediate results of the reduce in each node of the B-Tree on disk. Then, when a change happens, you can run reduce on the nodes of the tree that were affected by the update. Here is a visual example of a simple view that uses reduce to produce a sum of numbers:
This way, we get the elegance of the high road with the performance of adventure mode (admittedly, at the cost of storing a lot of intermediate values). Many more databases should be adopting a solution like this; it’s a simple, beautiful solution to an important problem.
Of course, the model-view problem isn’t limited to databases. To help out with some of my recent optimization work, I implemented a simple in-memory version of this algorithm in python: it works just like an (immutable) list, but updates a view based on a reduce function that you specify. Starting today, you can start making incremental map-reduces like so:
$ pip install ScenicOverlook$ python>>> from scenicoverlook import viewablelist>>> numbers = viewablelist([5, 10, 5, 0])>>> # Incrementally maintains the sum of a list:>>> numbers.reduce(lambda x, y: x + y)20>>> numbers = numbers[1:]>>> numbers.reduce(lambda x, y: x + y)15>>> numbers = numbers + [10, 5]>>> numbers.reduce(lambda x, y: x + y)30
See the pydocs for more examples. These are very simple examples, but I believe that many of my problem cases above could, in theory, be expressed as a graph of more complex map-reduce operations.
First, I think it’s valuable to have a name for this concept. These days, when I recognize new instances of this problem, I try not to rush the solution. I compare it to other instances of the problem and think about what approaches work well in those cases. Perhaps you’ll find this reflection useful as well.
Second, we need more mental energy applied to the model-view problem. Incremental map-reduce should exist in more databases and in more languages. But incremental map-reduce is not the only answer: there should be many more options, each with different characteristics. Today, “solving” the model-view problem just means averting disaster; ideally it would be more like picking which data structure to use.