425 reads

Evaluating Dependencies — Falcor Case Study

by Terry CrowleyApril 10th, 2019

Too Long; Didn't Read

A friend passed on a reference to <a href="https://netflix.github.io/falcor/starter/what-is-falcor.html">Falcor</a>, a JavaScript framework that manages data exchange between a client application and a backend server. It was open-sourced by Netflix and is in current use as a core part of their client applications.

Companies Mentioned

Coin Mentioned

Evaluating Dependencies — Falcor Case Study

A friend passed on a reference to Falcor, a JavaScript framework that manages data exchange between a client application and a backend server. It was open-sourced by Netflix and is in current use as a core part of their client applications.

He passed it along because he knew my interest (or more likely skepticism) would be piqued by their “One Model Everywhere” claim that

“You code the same way no matter where the data is, whether in memory on the client or over the network on the server.

One of the deepest principles I reinforced with any developer I worked with was that the distinction between local (fast, reliable, generally synchronous) and remote (slow, error-prone, always asynchronous) was fundamental and grounded in physics. It was one critical thing that you did not want to try to abstract or hide because both the architecture of your application and the design of your user experience needed to embrace that distinction.

I also tried to share a mental framework for how to analyze new components, technologies, architectures or applications. While nothing beats practical use and measurement of a new technology to gain deep insight, often you have to look at and evaluate technologies quickly and you want to be have a strong mental model to use for that analysis.

Let’s look at Falcor as an example of that process.

Normally of course you start by checking references — who wrote it, who’s using it, who’s supporting it? Where is it in active production use? Let’s skip that part for now although we’ll come back to it a bit later.

Let’s review Application Architecture 101. This is part of our core existing mental model we use when analyzing any new technology.

Every application has an application data model. That model might include some data that is shared across lots of users — for example a blob of data that represents the current weather information for Seattle. Then it has some data that is custom to a particular user — for example the list of cities I want to show on the front screen of my weather app. For Netflix, they have a shared catalog of videos and then per-user history, ratings and preferences.

The “golden” version of all this data lives up in the cloud. The client application requests some subset of this data to load into the local in-memory data model (and possibly caches some of it locally so it can start up more quickly next time or be robust to flaky connections).

The in-memory data model gets projected into the view or user interface. That projection might be very straightforward or quite complex. As the user takes actions, the application will update by mapping new parts of the local data model into the view or might request additional data from the cloud to bring into the local data model and then eventually into the view. Other user actions might alter the local data model and/or invoke actions that alter the user data stored in the cloud and then are reflected back to the application.

That’s about it. Virtually every application you use has that basic structure. Why am I reading this post?

Oh yeah, I left out performance and complexity. Let’s talk about those aspects as well. I’ll start with performance.

These days, the local data model can be quite large and still allow for great perceived performance. If you look at the memory usage of your gmail browser tab or of your Microsoft Outlook application, you can see that applications can be using hundreds of megabytes of memory and still provide a fast, responsive user experience. I’m not arguing you should ignore memory usage. In fact, managing memory is crucial, but relative to the other parts of the application, reducing your application’s core data model sitting in memory is usually not the first place to look to improve interactive performance. That is especially true for the real core of the application data model which typically ends up being remarkably small compared to all the other ways we programmers find for using memory.

Mapping that data model into the view requires much more care, especially if the data model is large. Literally since the dawn of interactive applications, it has been way, way more expensive to paint the user interface than to represent the application state in memory. It still is true. An application of any complexity will virtualize the UI in some way so it only instantiates expensive user interface constructs for elements that the user can actually see on the screen.

When we were extending the Word Web App to support long documents, the core data model for the main text of the document was typically quite small even for long documents (especially relative to any images that might be in the document). The real trick to achieving good performance was in minimizing the number of HTML elements created when mapping this data model into the browsers object model.

Updating the view quickly and efficiently in response to user actions is also key to good interactive performance and typically has tighter deadlines and more complexity than constructing the view in the first place. React, from Facebook, is an example of a highly valued framework that helps solve this very hard problem. React core doesn’t solve the virtualization part of the problem but it interacts well with third party component libraries that do.

Before we can store the data model in memory, we have to load it from the server. This is another place in our data flow where we have a mismatch between that capacious local memory and the performance of the pipe that fills that memory. A request can take 10’s to 100’s to 1000’s of milliseconds to complete (or infinite if the connection fails). One feature that is true in virtually every part of the application stack is that batching together requests and responses can make a huge difference in effective performance.

This is another characteristic of application design that has been true forever. In many cases some fixed overhead and latency is a large part of the cost of processing one or many requests, so batching multiple requests can result in a multiplicative improvement in effective performance. The other key performance characteristic is that bandwidth virtually always improves faster than latency across every layer of the stack, so while the delay before the first byte of a response is received only improves slowly (and can always vary widely), larger responses can be received relatively quickly and the size of response that can be handled efficiently has universally improved much more rapidly than latency. This ends up being another strong argument for both batching and for making “big requests” rather than lots of small requests.

Full stack developers (known as a “developer” 30 years ago :-)) have become more sensitive to the fact that they cannot just ignore the costs of server processing as well. Any request the application makes will involve some amount of IO, memory usage, computation and communication in the cloud. That might sound trivially obvious, but it can be surprising how many client developers basically think “magic happens” at the other end of the wire. The fact that the costs are incurred far from the device the application is running on does not change the reality of the costs and the implications for application performance.

Batching helps significantly on the service side as well.

Perhaps the most effective technique for reducing service-side costs is to try to ensure that the data an application requests is already available in the minimum number of “chunks” possible (whether that chunk is a page on a storage device in a database or a block in a memory cache). This often involves duplicating data (or denormalizing in database terminology) so that the requested data can all be fetched in one low-level IO request. The end-to-end challenge is that you are now tying dynamic application behavior (what data the application wants to access together) with your state sitting in storage. Since a single service might serve multiple clients with a different updating cadence and different dynamic behavior, care needs to be taken to avoid binding the clients and the service too tightly.

This is often where friction and complexity arises as client developers act to quickly respond to new feature requests (changing service load behavior) and service developers focus on service stability, scalability and being good stewards of the long-term data stored in the service.

Note this whole discussion is about how the data flows from the storage or memory in the service, over the wire, into memory on the client and into the UI. Where you need to worry about latency, batching, locality, or controlling the flow has stayed remarkably stable for decades. The raw volume and the amount of processing you can do at a node in the system before or after moving the data has increased markedly everywhere in the pipeline but where you need to be careful has not changed.

These invariants are what makes having a strong mental model so valuable when you do this kind of analysis.

OK, let’s get back to Falcor. You should have swept through everything before this since it was just review confirming our shared mental context, right?

Traditionally, when a component says it’s going to provide “the same programming model for local and remote” it has meant trying to project a local API pattern on to a remote resource. This was the classic original design point for Remote Procedure Call back at Xerox Parc. In Falcor’s case, instead they are exposing an asynchronous (remote-like) pattern on all data, even if that data is already loaded in the local data model. Certainly a bold choice.

This is good and bad. The good part is they are not falsely hiding slow blocking behavior behind an API that looks fast and synchronous. From a practical matter, this is way harder to do in JavaScript than it was traditionally using native APIs. The bad part is they are now making even local data model access slower and much more expensive, both practically and in terms of the coding structure required to access it.

Falcor is a framework with “attitude”. This choice and the choice to mostly just expose the ability to set and get individual data values are examples where the Falcor designers demonstrate that their expected app behavior is one where an app “sips” at the network for data, maps that data pretty directly into the UI and generally minimizes local data context and processing. (Or at least for the application data model — the Netflix app built on top of Falcor is loading images, streaming video and in aggregate is the largest consumer of bandwidth on the Internet.) Essentially it is making the choice to ignore the fact that big requests (with good locality) might actually be relatively efficient through the stack and that local data processing can also be super efficient if care is taken in how that data is mapped into the UI.

Falcor does aggressively batch those individual data value requests (in many ways this is the key functional value the framework provides), but its core model of a graph makes no effort to otherwise stress locality of reference as a key application design requirement.

Falcor also is designed to own the local data model (“we are the ‘M’ in MVC”). When using any component, there is always the question of how tightly embedded that component becomes in your overall code base (on a continuum from isolated and replaceable to effectively requiring a rewrite to replace). A component that takes over the structure of the application model and service communication is clearly heavily embedded. That generally means that you better believe it is a great match for you now and will continue to be a great match in the future.

This part of the analysis is also where looking at how the component is used by the original authors and others is useful. For Netflix, such a strongly opinionated framework serves the additional function of enforcing standardization on their collection of applications and the individual groups responsible for them. They are using the software structure to enforce a set of guidelines and behaviors for their own development teams. Additionally, when you own both the framework and the apps, you have the flexibility to move those in parallel as necessary when requirements change. Microsoft Office would heavily use control of the libraries used across all of the Office apps to make major consistent suite-wide changes in application features and behavior. We could link major library changes with the work to adopt those changes in the applications.

A third party user of a framework evolving in this way suffers in two ways. It is hard to impact the overall direction of the framework externally. When the framework does move, it tends to move in bigger jerks and shifts that either leaves you on some older version that is no longer well-supported or requires that you invest non-trivial effort to move to a new version that might be changing in ways that are not especially critical for your development. This is not a statement about Falcor’s particular history but is a recurring pattern in this type of situation.

This is a bit of a nit but I did find their documentation of the JSON Graph feature annoying in a way that is common to many frameworks. Most non-trivial data models cannot be modeled as simple trees. There are certain objects in the data model that need to be referenced from multiple places. In database terms, that is why you normalize the data model, break your application data model into multiple tables and use “foreign keys” to link between tables. There are good reasons to standardize this mechanism within the framework (not least because their decision to model the data as a single tree rather than allow multiple trees requires it). The annoying bit is that you spend all this time reading about the mechanism and have to figure out for yourself “oh, this is just your local solution to the common problem ‘X’ that everyone needs to solve”. In fact, often because ‘X’ is such a key common architectural element, it tends to accrete additional framework-specific functionality and that documentation detail further obscures that its real core purpose is just this one basic thing.

So how does all this roll up? I’ll first reiterate that this is supposed to be an example of an analysis where you haven’t done the more time-consuming work of building real prototypes, talking with existing clients and overall pounding on the component. That can be critical parts of the process for components you end up coming to depend on but you also need to be able to stage your investigation and investment — and some issues only arise over time and after significant investment as well. For that reason, you also want a good conceptual model for the key value the component is providing in a way that can be fairly simply stated and evaluated.

This component is clearly software that has been an integral part of a set of very popular applications. That is a big endorsement.

However, there are a number of other things that would turn me off from using this particular component. Its focus on treating local data access as expensive is a technique for overall minimizing service interaction and load. That might be a good thing for the Netflix app and service model but it is not an inherent requirement for well-behaved applications. That is a focus that would run in conflict with a large class of applications.

Treating the overall data model as a graph and focusing on retrieving individual data values provides a consistent uniform modeling technique but is one that ignores — almost intentionally obscures — that ensuring end-to-end locality from application model, service load through service storage is often the key technique for driving application performance.

The fact that the framework is heavily driven by a single set of private applications makes it somewhat risky to depend on. It means it is real production software but risks being driven by priorities that are very different from yours.

Falcor is not really solving a specific hard problem. If your application is trivial, then it is pretty trivial to roll your own. If your application is sophisticated — or gets sophisticated — then having a framework with a specific, very enforced “attitude” means you either need to be a great match for that attitude or you end up bending over backwards trying to tweak the framework into doing what you want. That probably increases dissonance with any future evolution of the framework so you are out of touch with its broad set of users — not a good place to be.

The core question I always ask is “what is the key hard problem that this component is solving for me”? In this case the right answer is probably “provide a rigid model that constrains application behavior”. That better be what you are looking for if you take on such a deep dependency.