

Backend engineers canโt write elegant and performant application-layer code for many types of everyday business logic. Thatโs because no one has yet invented a โdenormalization engineโ, a database with a more general kind of indexer.
To see what I mean, letโs dive into an example application and talk about its data architecture: how it works today vs. how weโd ideally want it to work.
Say we want to build a messaging application with various conversation โroomsโ. Basically a clone of Facebook Messenger, which has one room per person youโre chatting with:
Notice that in this screenshot, there are two unread rooms. Weโll define an unread room (with respect to the current logged-in user) as โa room whose last messageโs timestamp is greater than the timestamp when the current user last viewed itโ.
Naturally, we want to be able to render a UI like this:
How do we architect the calculation of the number of unread rooms? It seems like a straightforward problem, an example of what I mean by โeveryday business logicโ.
Weโll consider two architectural approaches for counting unread rooms: the normalized approach and the denormalized approach.
When building an application, thereโs always a core schema of data types and fields that we absolutely need in our database. Thatโs the normalized schema for our application.
Designing a normalized schema is largely product-definition work, not engineering implementation work. Thatโs because the set of possible values for the data in a normalized schema corresponds to the set of possible states that your application can meaningfully be in. Good programmers put a lot of thought into designing their applicationโs normalized schema.
In our messaging appโs normalized schema, there has to be a User type (a.k.a. User table, User collection) and a Room type. There also has to be a Message type, which looks like what youโd expect:
And there has to be a RoomUser type, which stores facts about <roomId, userId> pairs, such as the last time user โliron-shapiraโ (yours truly) saw the inside of room โr31โ (conversation with Sasha Rosse):
Notice that the aforementioned timestamp, 1310 (donโt try to make sense of these fake timestamp numbers), is before Sasha Rosse sent me a message containing the content โPerfect!โ This is how the application knows that room r31 is an unread room, and how the UI decides to bold its fonts here:
Similarly, my user saw room r20 at timestamp 1308, which is before Mac Tyler sent me the messages โI have a great ideaโฆโ and โLetโs go see Giraffage tonight!โ So r20 is a second unread room (with respect to my user):
The other roomsโ RoomUser objects arenโt shown, but we can deduce from their rendered UIโs lack of bold fonts that their corresponding RoomUser.seenTimestamp values are greater than or equal to the greatest timestamp of their contained messages:
The core schema we just went overโโโwhose types include User, Room, Message and RoomUserโโโis completely normalized, meaning none of the fields are logically redundant with any of the other fields. I define the normalized approach to mean forcing yourself to survive on a completely normalized schema.
In the normalized approach, the number of unread rooms isnโt explicitly stored anywhere in our databaseโrather, itโs logically implied by the data in our normalized schema. So how do we ultimately make this โ2โ appear?
The normalized approachโs answer is that weโll get it from a query, made-to-order at the time we need it. Sounds reasonable, right? Just a little olโ query. Like this:
HOLY MOTHER, that query has quite a few moving parts, including scans and joins on multiple data types. I wrote it as a ReQL query, but regardless of which database youโre using, its performance is going to be unacceptable at scale. (Ok database ninjas, Iโm sure you have a way to configure your query engine to handle this beast, but admit I have a point here.)
And thatโs the characteristic problem with the normalized approach: In exchange for the simplicity of working exclusively with normalized data, you have to write queries that donโt scale.
The only way we can get efficient queries is to give up on having a completely normalized data model, and add some logically-redundant denormalized fields to our schema.
For our messaging app, letโs add a denormalized field called User.numUnreadRooms. Then a user object will look like this:
This kind of schema is the most efficient thing you could ever hope to query when your frontend needs to show the number of unread rooms:
Query efficiency is a nice-to-have, but if weโre only architecting our messaging app for up to 100,000 users, we can afford to live without it. On the other hand, if weโre architecting for 10M+ users, itโs a must-have. For serving up unread-message counts at scale, the denormalized approach is the only approach.
Letโs keep going with the denormalized approach for our messaging app. So far, all weโve done is add a field to our schema definition called User.numUnreadRooms. How do we get a value of 2 in there?
The answer is, we need to write an โupdater queryโ, like this:
r.table('user').get('liron-shapira').update({numUnreadRooms: x})
Thatโs part of the answer, but we still have two questions:
What are all the events in our messaging application for which it might be necessary to update a value of some User.numUnreadRooms field?
Imagine youโre really making this messaging application and try to come up with the answer. Then scroll downโฆ
.
.
.
How about when a new message is sent, and we insert a new Message object into our database? Yep.
How about when a user opens up a room and views its latest messages, thereby updating their RoomUser.seenTimestamp? Yep.
Is that it? Scroll downโฆ
.
.
.
The answer is: there are more situations to think about.
Did you think about what happens if a user leaves or deletes a room, thereby deleting a RoomUser record? (There are other equally valid implementation choices for deleting a room, like setting a Room.isDeleted flag, but Iโm picking something compatible with the query-beast from the โnormalized approachโ section.)
A user who leaves or deletes a room can potentially require you to subtract one from their numUnreadRooms!
How about if you want your messaging application to support deleting individual messages, like Slack does?
If the message being deleted is the only one in the room that a user hasnโt seen yet, then it will need to decrement that userโs numUnreadRooms.
It turns out that quite a few events are tightly coupled with the User.numUnreadRooms field. Thereโs even one I havenโt mentioned yet. Do you see the potential for buggy code?
On the day youโre implementing the User.numUnreadRooms field, youโll rack your brain for all the places in your code that will need to update it, and I donโt blame you for overlooking the onDeleteRoom event handler.
On the day youโre implementing the onDeleteRoom event handler, youโll rack your brain for all the denormalized fields that it might need to update, and I donโt blame you for overlooking the User.numUnreadRooms field.
I donโt blame you, the application developer, for any bug in your denormalization logic. I blame the paradigm of tightly coupling denormalization logic with application code.
Iโll talk more about the paradigm shift we need, but first Iโd like to keep exploring all the issues with denormalization, and piling on about how annoying they are. And keep in mind, this isnโt supposed to be rocket science. Itโs just a messaging application.
Say weโre writing some event-handler code for when a user posts a new message, or sees the inside of a room, or deletes a room, or something like that. Recall that the updater query for my liron-shapira user looks something like this:
r.table('user').get('liron-shapira').update({numUnreadRooms: x})
There are many ways we can make this updater-query work. The most important distinction is whether we decide to do a pure or an incremental recomputation.
If we compute our denormalized User.numUnreadRooms field as a pure function of some of our normalized fields, thatโs called a pure recomputation.
Conveniently, the beast-query we were talking about earlier is a pure function of some of our normalized fields. All we have to do is substitute it for x in our updater expression, and now we have an example of pure recomputation:
The nice thing about a pure recomputation is that the new value of User.numUnreadRooms doesnโt depend on the previous values of your denormalized data. If the previous value of User.numUnreadRooms had become corrupted to any extent, or been outright deleted, thatโs not a problem.
This is a useful property because itโs pretty common for denormalized fields to get corrupted. You might have a bug in your code which previously calculated the wrong value and stored it. Or the non-atomic semantics of your updater query might occasionally make your value off by one or two (more about atomicity at the end of this post).
Letโs say user sasha-rosse posts a new message into room r31. We know our handler for the โnew messageโ event might need to update my userโs numUnreadRooms, but weโd like to avoid a pure recomputation.
With a pure recomputation, computing resources are wasted scanning indiscriminately through all the rooms Iโm in and their message lists, because we know that only one roomโs message list is currently being updated. Furthermore, we know that the only possible consequence of a โnew messageโ event for my userโs numUnreadRooms value is either an increment by 1, or no change. We can use this insight to increase the efficiency of our updater logic:
We call this an incremental recomputation of User.numUnreadRooms because itโs not pureโโโit doesnโt calculate numUnreadRooms as a pure function of other fieldsโ values. Rather, it introduces a dependency on the previous value of User.numUnreadRooms.
Itโs a bit concerning that if our numUnreadRooms value ever gets corrupted or deleted, then no amount of these incremental recomputations will ever get it back to the correct value. Weโve made our denormalized data more brittle. But on the plus side, we got a major performance gain in our updater logic by using an incremental recomputation.
Ok but we didnโt just make our data more brittle; we also made our application logic more brittle. It was hard enough for us to track which event handlers need to run updates on User.numUnreadRooms. Now we also have to invent separate queries for each event handler, with separate clever performance optimizations:
Furthermore, whenever we modify our applicationโs behavior in any way, we have to reconsider the logic behind each of our incremental recomputations, and whether it still holds sound.
For example, if we introduce a โmark as unreadโ feature:
Our event handler for โmark as unreadโ will presumably have its own logic for incrementally recomputing User.numUnreadRooms value, so thatโs not the issue. The issue is, do we know whether introducing this new feature will break the incremental-recomputation logic we wrote for the โnew messageโ event handler?
The answer is yes and no:
When you modify your normalized schema by adding a field like RoomUser.isMarkedAsUnread, it makes sense that you should have to reconsider the update logic for the User.numUnreadRooms fieldโโโbut only in one place (like the one beast query in the full-recomputation approach), not throughout your codebase in every event-handlerโs logic.
Personally, I bite the bullet and take the denormalization approach in order to make my apps scalable. I hate having to scatter recomputation triggers in an ad-hoc fashion throughout my event handlers, but I donโt see an alternative. Itโs a dangerous monster that I have to keep a close eye on.
In order to keep the monster from breaking out of its cage, I stick with pure recomputations wherever I can. If you want to get fancy with incremental recomputation, or you have to because performance is unacceptable, then fineโโโbut you better have some solid tests.
In the last two sections, we saw two reasons why application-layer denormalization is a struggle: weโve had to struggle to achieve correctness, and struggle again to achieve each incremental improvement in efficiency. After all this struggle, can we at least expect that our efforts will yield good code?
No. Itโs the opposite: we can expect that our efforts will yield clunky code.
A good codebase is one that makes you, the programmer, nimble. Such a codebase minimizes your workload in the face of ever-changing (1) behavioral requirements and (2) schema field definitions.
The normalized approach, for all its performance flaws, at least lets us take our beast of a query and wrap it in a functionโโโsomething like getNumUnreadRooms(userId). That function can be referenced from anywhere in our codebase, so any change to the schema field definition of User.numUnreadRooms requires us to change our code in only one place. With the normalized approach, we are nimble.
But with the denormalized approach, when we change anything about our behavioral requirements or our schema field definition, it breaks our painstakingly-reasoned and painstakingly-optimized code. To fix it, we have to go through our event handlers one by one and re-reason about their correctness and re-optimize their behavior. Thatโs clunky code.
If your boss asked you to submit a โschema designโ for a messaging app like this, what kinds of information would you include? Probably these three:
Think about what #3 means. An index definition is a declarative expression that tells a database to create an auxiliary data structure and keep it in sync with one or more of your tables. Itโs typically just a few lines of code, but it has a huge impact: it tells your database to crank out nonstop inserts, updates, deletesโโโany operations necessary to maintain a certain invariant relationship between your tables and that auxiliary data structure.
See where Iโm going with this? An indexed field is a special case of a denormalized field! Therefore I propose we give your boss an additional piece of information as part of your schema design submission:
4. Definitions of denormalized fields
If youโre not convinced yet, look what happens when you only write out schema design parts #1 and #2 for the User table (thereโs no #3 in this case because weโre not interested in building an index on any User fields):
Is this really worthy of being called a โschema designโ? Surely not, because weโre omitting the fact that User.numUnreadRooms is a denormalized field. Weโre making it look like an independent degree of freedom that our application state has, which is wrong. So we need to include #4 in our schema design: we need to indicate that User.numUnreadRooms is a denormalized field.
Furthermore, we need to explicitly define an expression that shows how its value is logically determined. What will the actual syntax look like for such an expression? Iโm not sure yet. Iโm going to think about it more. All I know is that some kind of declarative denormalization syntax is bound to be invented soon, for the same reason index-definition syntax was invented a few decades ago.
Itโs obvious to todayโs programmers that index definitions should be part of schema design. But at one point it was a paradigm shift. If our databases keep pushing the expressive power of their schema definition languages, we can anticipate that there will be another paradigm shift for database-backed application programming.
In software engineering, a paradigm shift is when something which is bad to think about becomes impossible to think about.
Right now, working with denormalized data is bad. If youโve ever written anything less than the perfect denormalized-field recomputation code, then running it will have introduced data corruptions. Just like that, youโve silently nuked the logical invariant that you were hoping to maintain on your data set. Donโt you wish that were impossible?
Our current approach to denormalization has a paradigm-shaped hole. We can see the outline of a better paradigm ahead; that paradigm just hasnโt been invented yet.
Iโm coining the term denormalization engine to refer to a โgeneralized indexerโ: a database engine (or a layer on top) that can process denormalized-field definitions in much the same way as a database can process index definitions.
In our messaging app, User.numUnreadRooms should be a declarative expression in some kind of schema-definition file. Weโd like a denormalization engine to understand our definition of that field and always keep it up-to-date for us.
Many databases have taken a few small steps toward being denormalization engines, mostly via their implementations of materialized views (a.k.a. computed indexes).
A materialized view is basically a set of denormalized fields. Many databases force the programmer to manually invoke the recalculation of the values in a denormalized view, but those ones arenโt interesting to talk about here. The interesting ones have either or both of these features:
The logic of how to infer incremental updates from the databaseโs log, and the logic of detecting when a recompute is necessary, is nontrivial. So these features represent legitimate steps toward denormalization engines.
Unfortunately, the steps are quite small. The databases that support #1 or #2 donโt claim to do so for arbitrary denormalization relationships. They come with massive restrictions, usually in the form of a whitelist of composable operations.
Oracleโs materialized views seem to offer some of the most advanced functionality out there, but are still very limited. Thereโs no database in the market today that can implement our desired definition of User.numUnreadFields via a materialized view, as far as I know.
UPDATE: Thanks to Connor McDonaldโs reply and others, it seems that Oracleโs materialized views are powerful enough to implement this User.numUnreadFields query.
UPDATE #2: But thereโs no database in which it would be practical to have a bunch of material views with arbitrarily-complex queries, and chain such views together in a graph structure.
Building a denormalization engine will be a big technical challenge for the database builders out there. In fact, I used to feel that it was unrealistic to expect that anyone would build a denormalization engine any time soonโฆ until I realized that weโre all currently acting as our own sloppy application-specific denormalization engines, resulting in bugs and clunky code. Thereโs got to be a way to improve things a little bit beyond the status quo.
Weโve been talking as if denormalized data is a pure function of normalized data. While thatโs technically true, itโs often more natural to describe denormalized data as a pure function of other denormalized data.
In our messaging app, it makes sense to define a second denormalized field called Room.lastMessageTimestamp using the following pure function:
Once we have that, we can use it as a building block for a more efficiently-recomputable definition of User.numUnreadRooms:
Once you define denormalized fields in terms of other denormalized fields, you might get a pretty rich directed acyclic graph (DAG) structure which I call a denormalization graph.
When someone invents a way to declare denormalized fields as part of a formal schema definition, all of those field definitions together with your normalized fields should be visualizable on a single denormalization graph.
Event sourcing means recording all updates to your database with a series of event objects, and being able to recompute all your tables/collections/views as a pure function of your event objects.
(Technically, any database with a log is an example of event sourcing, because we can consider every log entry as an event object. Event sourcing gets more interesting when all your event objects have application-level meaning.)
I just want to put event sourcing on the conceptual radar because if we take the idea of denormalization graphs to the logical extreme, then the only data that can be said to be โnormalized dataโ is precisely that which the event sourcing crowd would call โevent objectsโ.
Event sourcing has a ton of great benefits. This video does a good job explaining some of them. I think the reason event sourcing hasnโt caught on is because thereโs no good denormalization engine that can transform our stream of event objects into the schemas we usually consider to be โnormalizedโ.
Many databases donโt support atomic operations, a.k.a. transactions, across different data types. Itโs hard to make those scale. Thatโs surely one of the challenges of implementing a denormalization engine. And thatโs why, sadly, none of the updater queries given in this post are atomic.
But this whole post is about a problem thatโs strictly easier than atomicity: how to write elegant code thatโs correct (ignoring atomicity-violation-caused incorrectness), and reasonably performant, and nimble to change.
I mentioned that I think the best way to operate in todayโs pre-denormalization-engine era is to take the denormalized approach, and then stick to pure recomputations only. A pure recomputation is an idempotent operation, meaning you can do it arbitrarily many times without breaking anything. So I figure, if 0.1% of the time the value of some denormalized field is off by one or two due to the lack of atomicity, weโll just make sure it gets recalculated again sometime soon and the error will go away.
I would love to have a discussion about atomicity guarantees in light of our new shared language about normalized vs. denormalized fields, full vs. partial recomputation, denormalization graphs, etc. But I feel that talking about these things now would be analogous to talking about what HTTP should look like before weโd figured out TCP. The discussion would tend to keep veering into details about how to handle dropped packets, and packets arriving out of order. Letโs focus on building one abstraction layer at a time.
Itโs a joy to be able to build full-stack applications from zero to 1.0 in a few days. I personally feel comfortable with 90% of the work required to do so, but data denormalization is still in that last 10%ย .
I think the database, or the โdata layerโ, is overdue for a paradigm shift in the way it handles denormalization. I predict that powerful denormalization engines will be invented in the next few years.
In the meantime, Iโm planning to give this topic more thought and maybe even hack together a simple proof-of-concept denormalization engine. If you think youโd be interested to throw around ideas or work together, letโs chat.
UPDATE Dec 2016:
Iโve come to think that the framework-level solution to data denormalization has to look something like logic programming. Thatโs what got me excited about a new programming language called Eve. I just wrote this VI-part post series about Eve, and parts II and VI are basically followups to this post.
Create your free account to unlock your custom reading experience.