Quick — name one startup you know using a single-tenant architecture. Got one? Yeah, me neither.
Multi-tenant architectures are the standard way to run a startup these days. Create a database, provision some servers, add a load balancer, top it off with some caching and call it a day.
But why?
What is it about multi-tenant that’s superior to single-tenant? Is it cost? Complexity? Security? Scale?
Recently I started working on a new product and the multi-tenant architecture felt ungodly complex. There had to be a better solution — a quicker way to get up and running that wouldn’t create scaling issues for us later on.
Now, I’ll be honest. I’ve been developing software and running websites for 15 years and I’ve never had to deal with “internet scale” concerns. Hell, I’ve barely even had to deal with caching. I’ve been working on Kumu the past five years and — outside of embeds — Kumu’s traffic can be handled by a single large server. It’s a powerful, sophisticated tool, and growth has been solid, but it’s simply not the kind of stuff that goes viral.
We expect Compass to be a different story.
Compass helps you visualize your Slack team’s communication. Unlike Kumu — where you start small, build up your data incrementally, and handle most calculations locally in your browser — Compass taps into a firehose of data the second you sign up and most calculations are handled server-side.
Compass may never reach the scale we’re anticipating, but if it does we want to be prepared. And if it goes viral we don’t want to be woken up in the middle of the night to put out fires. We have lives, and wives, and kids, and hobbies, and many other things we’d much rather be doing. Like sleeping. (Trying to, at least.)
So instead of attempting to build a system that should scale, we’re putting in some time upfront to design an any-scale system: a system that’s built to remove scale from the equation (or at least make it somebody else’s responsibility).
With that goal, we’ve been exploring two potential architectures on top of AWS:
The remainder of this post explores the trade-offs we’ve considered between these two architectures. If there are any big ones we’ve missed, please mention them in the comments!
As a bootstrapped startup, cost is a big one. Get it wrong and it can cripple you before you get out of the gate. Get it right and you’re off to sipping mai tais on the beach while you still look good in a bathing suit. (Or you could use that money productively to create jobs and grow the product. You’re the boss.)
With multi-tenant architectures, the cost to run the system is fixed. You’re paying a lot up front but the good news is each new customer you add drives down the marginal cost of adding the next one. Outside of customer support, adding new customers doesn’t really cost you anything.
With single-tenant architectures, the marginal cost of adding new customers never goes down. It’s fixed. Each new customer requires a new instance and each of those new instances has to be paid for. Worse than that, the cost per customer actually goes up! Larger teams need larger instances. While we might be able to support teams of 10 to 20 on a t2.nano, we’ll need a much larger instance to support teams with hundreds of members.
So — since multi-tenant lowers cost per customer — it’s the clear winner here, right?
Well… no. Not really.
Besides the cost of the underlying infrastructure, there’s also these things called humans. And they’re expensive.
The single-tenant architecture is a simpler architecture with fewer moving pieces. Simpler systems can be supported by smaller teams. And those teams can be made up of developers instead of dedicated sysadmins — it’s the exact same system you’re already running locally for development.
Which brings us to the next concern: parity.
Parity is the notion of similarity between environments. One of the major downsides of the multi-tenant architecture is the lack of parity between the environments we need to support:
Each one of those environments is complex on its own. Add them together and it’s clear why sysadmins get paid the big bucks.
You could argue that you don’t really need a staging environment. Because that’s what tests are for, right?
You could also argue that worrying about an on-premise enterprise version at this stage is premature. And in many cases you’d be right. But in Compass’s case, we’re juggling messages that contain sensitive information. And as such, we’ve already had requests for an on-premise version. Since enterprise customers are typically your largest and most loyal customers, we’d be foolish not to factor them into our initial planning.
Single-tenant is the clear winner here since it gives you parity across all environments and an easy path to enterprise. As a small team with limited resources, we think that’s pretty sweet.
With multi-tenant, deploys are typically all or nothing. Maintenance on multi-tenant systems can be scary. You push out a single update, and every customer is immediately on the new system. If you botch it you take down the entire system. Been there, done that. Not fun when it happens.
With single-tenant, maintenance is incremental. If you botch it, you typically only take down a single team. Instead of deploying a single app update, you’re deploying N app updates. Instead of migrating one database, you’re migrating N databases. On the surface level it sounds like this would create more work for you, but since the systems are isolated and identical most of that work can be automated. All you need is a bit of tooling to orchestrate the updates.
A beautiful side effect of single-tenant maintenance is that you get incremental rollouts and targeted beta releases for free. No need to mess around with load balancers or juggle internal feature flags.
Both architectures offer their own form of resilience.
Multi-tenant creates resilience at the team level. Each team is serviced by multiple instances, spread across multiple regions/zones, and hosted behind load balancers. A team is unlikely to experience issues unless there is a system-wide outage.
Single-tenant creates resilience at the system level. Outside of DDOS attacks at the DNS level there are very few ways to take down the entire system. A team may experience problems, but it’s unlikely those problems will extend beyond that single instance.
To account for disasters, we can throw in EBS backups, health checks, and a recovery instance that can stuff Slack events into a queue until a new server is provisioned. Now we’ve got a simple, resilient system at both the team and system level.
Team and system resilience? Yesssss
Plus, in general, a single angry customer is much easier to deal with than an angry mob. So chalk another one up for single-tenant here.
On Kumu, every project is backed by a separate CouchDB database. Over the years we’ve found this isolation extremely valuable. Sometimes we mess things up. Sometimes the customer messes things up. Regardless of who’s to blame, disaster recovery is much easier when each customer’s data is physically isolated, rather than simply being logically isolated within a single database. Database restorations become simple filesystem copies instead of fragile, complex database queries.
As far as I’m concerned, simple is the best kind of secure. Complex systems often give the illusion of security that isn’t truly there.
If everything is running locally on a single machine, and that machine is locked down with key-based SSH access, and the only other port that machine exposes is port 443 — then I’m not losing sleep at night worrying about security breaches.
Yes, the machine is exposed directly to the internet. But as long as any part of the system is exposed, direct exposure isn’t inherently less secure than indirect exposure. You can easily mess up either one.
If two systems have similar exposure and one is significantly simpler, I’ll go with the simpler system every time. Less surface area. Less complexity. Easier to audit. Sold.
As with most things, there’s no holy grail here. Both architectures have their tradeoffs and both are solid solutions for the right problems. I’ve always used multi-tenant architectures in the past but in this case it just doesn’t feel like the right tool for the job.
At the end of the day it’s our job as engineers to find that sweet spot at the intersection between priorities and constraints. For Compass, that sweet spot appears to be single-tenant.
There’s a strong argument to be made for multi-tenant too, but for now, a single instance per team looks like the quickest way to get up and running while minimizing scaling concerns. It’s also important to note that single-tenant wouldn’t even be an option if we were hoping to allow cross-team analysis. That said, here are the key advantages single-tenant provides for Compass:
So that’s where we’re at. At this point Compass is just a prototype but we’re hoping to build out the backend over the next few weeks. If you run an active Slack team and you’re interested in being a beta tester, let me know! You can reach me at [email protected] or @rymohr on Twitter.
Do you have experience running single-tenant architectures at scale? If so I’d love to hear about it in the comments below or the related post on HN!