When I started working on a start-up nine months ago, this statement didn’t satisfy me or my co-founders:
“There is no way to build a Slack bot more reliable than a Slack itself.” — common sense.
Our product is an incident management tool called Amixr.IO. It’s a Slack app and yes, it should be super reliable. We didn’t leave our SE jobs in Silicon Valley and London to give up when faced with our first technical challenge.
What’s a reliable web service? And what’s not? In terms of reliability, Slack is pretty good. They do a lot of good things. For example they keep users posted about problems through status page and twitter.
Unfortunately, Slack has also experienced serious periods of downtime, and users remember that. That’s why users don’t expect an uptime-critical service, such as the our tool, to rely on it.
We’ve found a way to achieve that goal. Over the last few months, our service has proved much more reliable than Slack, and we’ve even unveiled an SLA because we’ve built something that customer can rely on. Six simple principles described below helped us here.
Any Slack app, as well as any other bot for Facebook or Telegram, uses a web server. Slack is just an interface that allows users to interact with a system. That means we can draw a red line across all of our applications. Parts that are connected to Slack are in danger; other parts depend only on us.
Imagine you’re creating a small Slack app. Its purpose is to post a message about new cards on your Trello board. The bot could be a simple web server that will wait for a webhook from Trello in order to post a new message to Slack. It could be as simple as:
from slackclient import SlackClient
from rest_framework.views import APIViewё
class TrelloForwarderAPIView(APIView):
def get(self, request):
sc = SlackClient(<bot_access_token>)
sc.api_call(
"chat.postMessage",
channel=<channel_id>,
text="New Cards in Trello detected",
)
Let’s imagine that sc.api_call raises an Exception because the Slack API doesn’t work. I’m sure it won’t last long, so there’s no need to panic. However, you won’t be able to publish a message about this particular Trello card. It will be missed, forever, and your business could be damaged.
Now, let’s draw a line between the Trello part and the Slack part and sandbox everything where Slack could cause data loss. To do that, we’ll receive a message from Trello, write it to a queue, and then only remove it from the queue when the message has been successfully posted.
from celery import shared_task
from slackclient import SlackClient
from rest_framework.views import APIView
@shared_task(autoretry_for=(Exception,), retry_backoff=True, max_retries=None)
def notify_slack():
sc = SlackClient(<bot_access_token>)
sc.api_call(
"chat.postMessage",
channel=<channel_id>,
text="New Cards in Trello detected",
)
class TrelloForwarderAPIView(APIView):
def get(self, request):
notify_slack.apply_async()
We use Celery, RabitMQ, and Django. Celery’s apply_async method publishes a task to RabbitMQ. In our case, it’s a highly available cluster. Celery workers look out for new tasks and try to execute them until they’re executed, without exception. Now, we can be sure that we won’t lose data if the Slack API causes exceptions on our side.
Slack is huge. It has millions of users online and uses multiple servers in regions across the world. If something goes wrong in one place, it doesn’t mean that everything is broken at another location. Even if 99% of servers aren’t able to proceed with a request, there’s another 1% that can.
That’s exactly what happened to Slack in June when its servers weren’t able to process a small fraction of our requests. However, our server didn’t give up, and sent everything successfully. Just nine messages got stuck at one moment, but we’ve sent them successfully since then.
We love it when Slack works, but we need to know when it doesn’t. That’s why we monitor Slack ourselves.
We check message delivery, verify API responses, and so forth. We proactively monitor Slack and sometimes detect issues, such as the one detailed in this message regarding the blocks API, before anyone else.
If you’re building your service on top of another service, then write scripts and automated tests that proactively monitor the third party. That’s the only way to be the first to learn about a problem.
OK, so you’ve built an awesome Slack app that works even when Slack doesn’t. How can you be sure that everything is ready?
Try simulate Slack downtime yourself. It’s just a small example of a fascinating subject called chaos engineering (https://principlesofchaos.org/). Simulate a problem, even in production, then check your monitoring and your backup systems.
A huge part of this trick is known as “graceful degradation.” It’s the ability to deactivate non-critical parts if they don’t work as expected.
In the case of total Slack downtime, our app will switch to a back-up delivery channels, such as e-mail, phone, or another messaging service. It reduces functionality and delivers business value anyway.
We don’t let a critical situation evolve without qualified people watching the process carefully. Even if a case-critical situation is 100% what we expected and recently tested for, there’s still a huge chance of the unexpected occurring.
Sometimes, engineers consider how to fix problems temporarily while causing as little damage as possible. For instance, during the issue with the blocks API, we were able to find a workaround pretty quickly. We dealt with the issue before Slack fixed the problem on its side.
When discussing outages, we’ve chosen a proactive strategy. We actively inform our users about outages before they notice anything suspicious. If that happens, they won’t consider our service unreliable — they’ll be confident that we know about the problem and are doing our best to fix it.
Fun fact: At the very beginning ,we did not have reserve notification channels about a downtime. For the first time, I had to look for users on social networks, write on Facebook, introduce ourselves, and make sure that we didn’t let anyone down. It was really creepy, but, in the end, people reacted positively and became regular users.
The six principles described above are universal. They can help to build a reliable service on top of any other third party, and they’ve already helped us with Slack.
Each product has different requirements so it's up to you which practices to implement.