Note: If you’re a user of Stream, be sure to update your API client to the latest version for a large improvement in reliability. For those of you on a custom API client, have a look at our updated REST documentation.
Domain resolution is one of the backbone services of the Internet. It’s something we typically spend very little time thinking about. Of course, that changes when it breaks. Over the past year, IO domain outages have been the number one reason our customers couldn’t use Stream. Specifically, the outage on September 20th, 2017 turned out to be a major headache. This article will go into the details behind the .IO domain name reliability issues and how we’re working around them.
The infrastructure of internet Domain Name System (DNS) is large and complex. Due to its decentralized nature, if the problem is with your DNS provider or the broader DNS infrastructure, there is little you can do other than sit and wait for the problem to be resolved. The only practical solution for dealing with DNS outages is to fallback to a backup domain.
This makes DNS outages quite nasty. Many risks are complex and costly to mitigate and in some scenarios, it is virtually impossible to do so.
On September 20th, 2017, our system monitors and health checks started to show intermittent failures. Pings to our website and API servers were failing to resolve “getstream.io” records to a valid hostname.
Domain name resolution is necessary to access our core API service and dashboard. Without it, clients are not able to find the address to our servers. It goes without saying that this was immediately triaged as critical and received the full attention of our team.
After an initial investigation, we discovered that resolving any getstream.io record would randomly fail with an incorrect NXDOMAIN error returned. Subsequently, one of our engineers identified that the resolution of .io domains would consistently fail on 2 of the 6 authoritative .io nameservers. The remaining four were operating correctly which explained the apparent random nature of the errors.
A bad one looks like the following:
Since this happened on the authoritative nameservers, we reached out to our DNS provider and then tried to get in touch with NIC.io as well. To our surprise, we found out that NIC.io could only be reached via phone between 7 AM to 12 AM UTC Monday through Friday and did not expose any status about the health of the service.
In the meantime, we started looking at who else was affected by this outage and posted about it on Twitter and Hacker News. While waiting for the outage to end, we also increased DNS TTL so that the amount of DNS queries would be as low as possible. Shortly after that, we received a reply from Gandi.net informing us that NIC.io was fixing the problem.
The outage lasted for almost 2 hours, during which 1/5th of DNS queries for any .getstream.io record would fail. For something that sits in front of our service, this is a huge problem and raised a more than a few questions on our end.
We get it. Sometimes things break. Realistically a similar outage could have occurred to any top level domain.
Back when we started in 2014 we decided that .io was great from a branding perspective. Stream is a technical product and our audience is mainly technical, so .io seemed like a great match. Using the same domain for the APIs was more of a consequence than a thoughtful decision.
It is impossible to estimate the likelihood of .com nameservers having the same kind of outages as the .io nameservers. One thing that surprised us was that while about 20% of DNS resolutions for all .io domains were totally broken, it was hard to find people complaining about that on Twitter. In fact I believe we were one of the first to tweet this. Had this happened on all .com domains, all news sources would have been on fire.
Unfortunately, we found out the hard way that NIC.IO isn’t equipped with the technical support and systems necessary to manage a top-level domain. Being unable to reach them while a major outage was happening is unacceptable.
Looking further, it does not take a lot of research to find out that the .io TLD team made several mistakes over the past few years. Just to name a few:
Searching for .io on HN returns a long list of similar outages.
Adding a .com domain and using it as the default on all our API clients is clearly the low hanging fruit. Of course we could have the same problem if .com had an outage, however, we are vastly more confident in the management behind .com. It is clear that not only would the issue have been identified earlier, but it also wouldn’t have taken hours for people to acknowledge and remedy the situation.
These DNS issues caused us to pause and think about all the ways in which a DNS can break.
Since we control the API clients, implementing a failover mechanism is easy. Setting up and maintaining a backup domain and/or a backup DNS provider can be very challenging. In the first case, we would need to keep hundreds of DNS records in sync and double our SSL certificates; secondly we would need to only change our infrastructure to not use any Route53 specific feature. For that, we need to keep all DNS records in sync across two different providers and ensure we don’t use any vendor specific feature. As an AWS customer, this is a major challenge as DNS is deeply integrated in many ways.
Looking forward, our plan is to add a .org domain and find a DNS provider to manage the nameservers.
In hindsight, using a .IO domain for our core APIs was not a great choice. The outage on September 20th showed how severe the problems and support infrastructure are. Based on our experience we would advise against using a .IO domain name if availability is important.
To work around the DNS issue, Stream’s API traffic now runs on a .com domain name. The site still runs on .io since this is harder to change and not as critical in terms of uptime. To further improve reliability we’re considering:
DNS as a whole is one of those things that most take for granted but can easily cause serious downtime and trouble. Using a widely used TLD like .com/.net/.org is the best and easiest way to ensure reliability.
This is a collaboration from the team at GetStream.io, led by Tommaso Barbugli, CTO at GetStream.io. The original blog post can be found at https://getstream.io/blog/stop-using-io-domain-names-for-production-traffic/.