When I co-founded my last company, one of our biggest challenges was delivering an API to our customers that would stay up 100% of the time. Our product was a content management system and an API serving millions of requests per month that our customers used to integrate content into their websites. After multiple outages that nearly crippled our business, we became obsessed with eliminating single points of failure.
Our customers built websites that made an API request to us for page content during their request/response lifecycle. This meant that if their API request failed, their page likely wouldn’t render. In other words: if our API went down, our customers’ websites went down with us.
This is a lesson we learned the hard way in our early days. Unreliable server hosting led to frequent intermittent outages and performance degradations that frustrated customers. A botched DNS migration led to hours of API downtime that took down dozens of customers’ websites for nearly half a day and left a large number of customers questioning whether they could continue relying on us (a handful of them left us).
After this incident, we recognized that ensuring near-100% uptime was an existential issue. A significant outage in the future could lead to us losing hard-earned customers and put our business in crisis.
Avoiding failure completely is not possible–you can only do your best to reduce your chances.
For example, “controlling your own fate” by running your own physical servers protects you against your hosting provider going down, but puts you in the position of having to handle security and scalability, both of which can easily take you down and be difficult to recover from.
For us, keeping our API up at all times and making sure it delivered high performance across the globe was crucial. But as a smaller company, we knew we didn’t have the resources to deliver global, highly scalable performance with near-100% uptime. So we turned to someone that did: Fastly.
Fastly describes itself as an “edge cloud platform powers fast, secure, and scalable digital experiences for the world’s most popular businesses”. They work with large customers including the New York Times, BuzzFeed, Pinterest, and New Relic. We put Fastly in front of our API as a cache layer so that all API requests were served via their CDN.
When one of our customers updated their website content in our backend, we invalidated the API keys for the specific bits of content that were edited. Non-cached requests hit our servers but we had a ~94% hit rate because content on our customers’ websites changed infrequently relative to the number of visitors they had. This meant that even if our database or servers experienced intermittent outages, our API remained up. We wouldn’t want this, but theoretically, our servers could have gone down completely for several hours and our customers’ websites would have stayed up.
During the early days of our company, we dealt with two separate DNS incidents that left us scarred. In the first incident, our DNS provider at the time accidentally “cancelled” our account from their system, leading to an outage that took nearly 6 hours for us to fully recover from. Our second incident occurred when routine DNS editing led to a malfunction by our [different] DNS provider, and took nearly half a day to resolve. DNS incidents are particularly damaging because even after an issue is identified and fixed, you have to wait for various DNS servers and ISP’s to clear their caches before customers’ see the fix on their end (DNS servers also tend to ignore your TTL setting and impose their own policy).
Our experiences made us extremely focused on eliminating any single point of failure across our architecture.
For DNS, we switched to using multiple nameservers from different DNS providers. DNS providers often allow and encourage you to use 4–6 redundant nameservers (eg. ns1.example.com, ns2.example.com). This is great: if one fails, requests will still be resolved by the others. But since all of your nameservers are from a single company, you’re putting a lot of faith that they are going to have 100% uptime.
For our application servers, we used Heroku’s monitoring and auto-scaling tools to make sure our performance didn’t degrade from spikes in traffic (or if Fastly went down and we suddenly needed to route all requests directly to our servers). In addition to caching our API with Fastly, we also cached our API at the application level using Memcached. This provided an additional layer of resiliency against intermittent database or server failure.
To protect against the rare possibly of a total outage across Heroku or AWS (which Heroku runs on), we maintained a separate server and database instance running on Google Cloud that we could failover to quickly.
No matter how reliable our API was, we had to accept that networks are unreliable and failures are bound to occur. We’ve all experienced trouble connecting to Wi-Fi, or had a phone call drop on us abruptly. Outages, routing problems, and other intermittent failures may be statistically unusual on the whole, but still bound to be happening all the time at some ambient background rate.
To overcome this sort of inherently unreliable environment, we helped our customers build applications that would be robust in the event of failure. We built SDK’s that included features such as automatically retrying when API requests fail, or supported using a local backup such as Redis.
Without realizing it, many of us are building single points of failure into our stack. To build resilient, fault-tolerant systems, you have to consider all aspects of your stack and find ways to keep your service up even when one or more things fail.