I eat food and build things. Opinions are my own.
As part of the Hippocratic Oath, doctors promise to “abstain from doing harm” to patients. Hippocrates expounds elsewhere that a doctor should “do good” and “not do harm.” As developers, we break stuff. A lot of stuff. In fact, there is an entire industry that exists solely to track the stuff we break. But is it possible that we can promise to continuously deliver web services (“do good”) without shipping show-stopping changes (“do harm”)?
The most obvious consumers to avoid harming are your paying users who consume your production environment. The intent of an eventual new or better feature is never a valid reason to break existing functionality that your customers have already paid for. A well-designed service running in a well-designed environment allows developers to add new and even drastically different features all the while providing seamless backwards compatibility.
But it’s not just production environments that benefit from such a “do no harm” mentality. Using the same design principles for internally facing services allows API developers to rapidly iterate without blocking developers on other teams.
Here are 5 design principles for the services in a service oriented architecture engineered to foster rapid development without breaking existing consumers.
It should be just as easy to host and consume five versions of a service as it is to do so for only one. Detailed descriptions of Continuous Integration and Delivery are outside the scope of this article; however, once you have an automated system for deploying the first version of the service, setting up the next five or six should be trivial. On the contrary, if you depend on human interaction to build, vet, and deploy your service, maintaining multiple concurrent versions will be a nightmare.
Thus, when designing CI and CD pipelines, consider how your build and deployment processes are going to handle more than one version of a service.
Changes should able to be authored and deployed with a guarantee of zero impact on all other versions. The easiest way to make such a guarantee is to host each major version of the application as a separate executable. In doing so, deploying a new major version (or an update to minor version) cannot possibly affect the executable bits of any of the other deployed applications.
The other (inferior) approach is to host all versions of a service in a single executable. In this case, adding or updating a major version necessitates deploying all other versions because the are all codified in a single deployed artifact. While care can be taken to prevent regressions on other versions, they are not inherently precluded.
Over time, you can deploy revisions of legacy versions that consume newer versions, yet maintain the same API signatures to maintain backwards compatibility. By doing this, legacy versions eventually become thin facades over the actual business and persistence layers embodied by the more recent versions. Yet at the same time, independence is maintained to deploy (and when needed, rollback) changes to each major version.
Per Semantic Versioning, a major version indicates a breaking change to existing functionality. Examples for a web service would be removing a route, adding a new required parameter, changing the required format of a parameter, or any similar change.
Browser-based consumers (such as SPAs) are more resilient to breaking changes; the user essentially reinstalls the app anyway each time a new version is available. This makes it relatively easy to deploy breaking backend changes lock-step with corresponding frontend ones. The main caveat here is that the deployments of multiple applications MUST happen concurrently. This reduces the freedom of multiple teams to perform releases to code they control and often ends up resulting in short service interruptions for consumers.
But the plot thickens with natively installed applications such as mobile or desktop apps. In this case, web service developers typically have zero control over the version that end users have installed. As a result, shipping breaking changes can cause paying customers to loose existing functionality. There are ways around this like requiring that older clients update to a newer versions. However, users generally view such changes as a critical bug in there app that they have to update to fix.
On the topic of versions, there are a few schools of thought on how to pass specify the version as a consumer. In an HTTP request, data can be passed in the URI, the headers, or the body. But not all requests support bodies (eg, GET) therefore we are left with only to options: versioning via the URI or versioning via the header. What’s the correct way? It doesn’t matter as long as you can deploy those versions independently.
The specification for an API should be freely available all consumers of the API. Stated differently, if you have access to the API, then you should have access to the spec. Depending on the spec NOT being available for security purposes (security by obscurity) is vulnerable to reverse-engineering and only ends up creating a barrier for legitimate developers.
In addition to being accessible, the spec should also be discoverable. Any hard work invested in defining the API is for nothing if no one knows where it is. In a previous post about documentation, I describe in a bit more detail the problem of wasting time writing docs that aren’t discoverable. The point is, make a spec, make it available along with the API, and then over-communicate where it can be found.
Before I launch into this section I want to define a few terms. I use the term “specification” or (spec) to refer to a rigorous, machine-readable description the definition of an API whereas I use the term “documentation” or (docs) to refer to a human-readable description of how to use the API. Both are needed, but they are, in fact, two distinct things that fulfill two distinct purposes.
If you start with machine-readable content (m) then there exists a deterministic function that produces human-readable content (h):
h = f(m)
One of the hallmarks of a mature documentation ecosystem is a plethora of tooling including utilities for rendering documentation from a machine-readable document (eg. OpenAPI, API Blueprint, RAML, etc).
However, the opposite conversion is a surprisingly difficult academic problem. There does not exist a deterministic function that produces machine-readable content (m) from primarily human-readable content (h):
m != f(h)
Natural language processing has come a long way and machine translation is an extraordinarily useful tool for working with human language; however, the technology has not (yet) progressed to the point of being able to create a rigorous and exhaustive spec from written text. Regardless of how well you generate your spec from some other source, it will always be lacking in some detail.
All of this being understood, if a tradeoff must be made between machine-readability and human-readability, always opting for machine-readability ends up netting both, anyway.
So, if humans read documentation to understand an API in order to code against it, then why make a big deal about machine-readability? In short, the answer is automation. Code generation, testing utilities, breaking change detection, service mocking, and validation frameworks are just a few of the many benefits unlocked by machine-readability.
And speaking of breaking change detection …
What’s easier than manually checking every API route on every single commit to ensure that a breaking change hasn’t entered the code base? Sadly, dealing with breakages after they have already been introduced is often easier than rigorously preventing such changes from being integrated. Without any scripted checks, putting out fires as they happen is usually the most practical solution. But with a machine-readable spec, you get the best of both worlds.
A sufficiently mature Continuous Integration pipeline should be able to prevent breaking changes from entering into your code base in the first place. By completely defining your API’s functionality with a machine-readable spec, and by placing that spec along side the code in source control, it is very easy to see how an API has evolved over time.
There are a number of tools for diffing two versions of a spec and determining of the changes are breaking or non-breaking. If you are using semantic versioning for your APIs, then that determination allows for revving the major or minor version respectively. (In the case of releasing a change that does not include a change to the spec, that would rev the patch number.)
During integration, have your CI system run a script that diffs the new version of the spec in source control with the version of the spec made available by the web service. (Remember in #2 about making the spec accessible?) Looking for breaking changes this way precludes the need for manual checks unless if the build breaks at this step.
By automating breaking change detection, the only manual intervention needed should be setting up a new version of the application whenever breaking changes must be deployed.
Thus far we have talked about predicting and mitigating the effects of breaking changes, but now let’s talk about preventing them in the first place.
A web API should be a rigorous seam between the logic it codifies and the systems that consume it. When this is the case, changes can be made to the implementation details of one system without breaking the other. But when an API leaks its own implementation details or makes assumptions about its consumers, breaking changes become very hard to avoid.
Eric Evens in his book Domain Driven Design uses the term “ubiquitous language” in talking about the rigorous vocabulary describing a Domain Model. This vocabulary transcends the software and ought to be used by developers, stakeholders, users, and anyone else who interacts with a project. In an extraordinarily naive simplification of ubiquitous language, it is a way for all parties to work more efficiently by speaking in common terms.
With respect to API design, the language should be driven from the Domain Model itself rather than the implementation of the Domain Model. This means that while it may be tempting to define endpoints that map models from a database or third-party services, if everyone talks about the data in different terms (eg. the ubiquitous language) those terms should be used by the API.
Domain experts should object to terms or structures that are awkward or inadequate to convey domain understanding; developers should watch for ambiguity or inconsistency that will trip up design. — Eric Evens
Along those lines, I would posit that describing a Domain Model in terms of a persistence layer will inevitably end up with inconstancies. As changes and optimizations are made to how data is persisted, reflecting those modifications in the API will eventually require breaking changes, even though the Domain Model itself has not changed.
As mentioned previously, the vocabulary used to describe data and the shape of the data itself should be based off of the Domain Model and not the backend implementation of the Domain Model. If a database table and API endpoint share the same name, that’s fine, as long as it is because both coincidentally track with the ubiquitous language.
There are legitimate cases, however, where for performance reasons, one domain object will be distributed across multiple tables. Likewise, multiple objects may share the same table. Even if these persistence implementations diverge from the common vocabulary, the API should still represent them as such.
This ensures a clean seam between an API’s implementation and that of its consumers; either party can change implementation with out mandating a corresponding change from the other.
Another way to leak implementation details is to assume HOW the clients will use the API and make design decisions accordingly. This leaks consumer details into the web service.
For example, it might be tempting to aggregate multiple unrelated domain concepts into a single call because that is how the web site needs it. But those needs are primarily driven by UX concerns of a single platform. By catering to the needs of a single platform, two things generally happen.
First, performance for other systems usually take a hit. This is either because too much unneeded data is sent across the wire or because not enough data is sent so multiple requests need to be made. Secondly, when the preferred platform needs an API change for UX reasons, the effect of those changes on other systems will be unpredictable and often cause bugs.
On all networks, but especially on mobile networks, the overhead of making dozens of requests causes significant performance issues. Because of this, aggregating requests is needed to improve performance, but aggregating based on a single platform’s needs is a leaky seam.
Sam Newman, in his book Building Microservices, offers the Backends For Frontends pattern as a solution to this problem. In this pattern a thin “BFF” API is created for a frontend platform with the sole purpose of facading other downstream APIs in order to tailor requests to a single frontend’s needs. In this case, the Domain (with its associated vocabulary) is the UX concerns of a frontend platform rather than the more general Domain Model implement by the downstream services. Because of this, it is legitimate to design BFF routes around frontend implementation.
Lastly, I want to talk briefly about a few stability patterns for keeping web services working between deployments.
The vast majority of building codes in the United States have something to do with fire safety, and much of that has to do with electrical systems. This is because electricity is dangerous. Circuit breakers in buildings exist as a point of failure that is intended to fail fast and fail first. This ensures that instead of a the whole building burning down, only a switch is flipped.
Michael T. Nygard in his book Release It! describes the “circuit breaker” pattern. In distributed systems, this pattern exists to provide much of the same functionality as a physical circuit breaker. It is a feature of code that will prevent access to a “circuit” (usually another service) if it determines that that circuit is currently faulty. It allows systems to detect and then gracefully degrade when underlying services fail.
In the context of “do no harm” this allows web services treat both upstream and downstream systems with respect. In the case of downstream services, kicking a system while it’s down can exacerbate whatever issue is causing it to misbehave. In the case of upstream systems, returning a meaningful 503 response with a retry-after header instead of puking a stack trace allows upstream services gracefully degrade as well.
Overall, circuit breakers provide a way to keep failures from propagating across seams in a distributed system. Breaking changes are bad, but cascading their resulting failures across your entire production environment is really bad.
Nygard also suggests wrapping every call in a timeout. This is a simple way to ensure that we don’t let requests that are bound to fail clog up currently working systems. This fail-fast mentality ties in well with the circuit breaker pattern. For example, one of the heuristics that can be used to trip a circuit is repeated timeouts. In this case, waiting too long for too many requests indicates that we should just stop making calls for a minute or two. There isn’t much else to say about this other than don’t just blindly accept your framework’s default timeout duration or behavior.
Users are generally pretty good at complaining when you breaking stuff they paid for. If you take down a critical system, they will let you know. But being able to revert a change before the bug reports start flooding in is a fantastic way to maintain the trust of your customers.
Your logging system should be able to tell you if there is a spike in errors after a deployment. A sufficiently mature CD pipeline will be able to automatically roll back a deployment if errors jump above a certain threshold. But for the rest of us peasants, we can just manually watch the logs and trigger a rollback if needed. This means that (like your docs) your logging system should be accessible and discoverable. You should also make an effort to keep false positives low (like zero) so that you don’t have to wonder if all of those errors are “real.”
Your logging system is also a fantastic way to determine what versions of what APIs are actually being used in the wild. Especially for legacy APIs, some endpoints may not be called anymore. This knowledge is useful during the sunsetting of older versions in order to slowly remove unused functionality over time. Being able to pull reports from real data is a more reliable way to do this than trying to track down each person who has written code against your service.
Here is a short test to determine if your web services “do no harm.”
If you answered “yes” to all of them, please send me your resume! If you answered “yes” to most of them, you are doing great with one or two things improve. If you answered “no” to most of them, you are probably spending a lot of time dealing with broken consumers. Any time spent being more “Hippocratic” will likely be a very worthwhile investment.