When I started working on a start-up nine months ago, this statement didn’t satisfy me or my co-founders: “There is no way to build a Slack bot more reliable than a Slack itself.” — common sense. Our product is an incident management tool called . It’s a Slack app and yes, it should be super reliable. We didn’t leave our SE jobs in Silicon Valley and London to give up when faced with our first technical challenge. Amixr.IO What’s a reliable web service? And what’s not? In terms of reliability, Slack is pretty good. They do a lot of good things. For example they keep users posted about problems through and . status page twitter Unfortunately, Slack has also experienced serious periods of downtime, and users remember that. That’s why users don’t expect an uptime-critical service, such as the our tool, to rely on it. We’ve found a way to achieve that goal. Over the last few months, our service has proved much more reliable than Slack, and we’ve even unveiled an SLA because we’ve built something that customer can rely on. Six simple principles described below helped us here. 1. Draw a red line Any Slack app, as well as any other bot for Facebook or Telegram, uses a web server. Slack is just an interface that allows users to interact with a system. That means we can draw a red line across all of our applications. Parts that are connected to Slack are in danger; other parts depend only on us. Imagine you’re creating a small Slack app. Its purpose is to post a message about new cards on your Trello board. The bot could be a simple web server that will wait for a from Trello in order to post a new message to Slack. It could be as simple as: webhook slackclient SlackClient rest_framework.views APIViewё sc = SlackClient(<bot_access_token>)
        sc.api_call( ,
           channel=<channel_id>,
           text= ,
        ) from import from import : class TrelloForwarderAPIView (APIView) : def get (self, request) "chat.postMessage" "New Cards in Trello detected" Let’s imagine that sc.api_call raises an Exception because the Slack API doesn’t work. I’m sure it won’t last long, so there’s no need to panic. However, you won’t be able to publish a message about this particular Trello card. It will be missed, forever, and your business could be damaged. Now, let’s draw a line between the Trello part and the Slack part and sandbox everything where Slack could cause data loss. To do that, we’ll receive a message from Trello, write it to a queue, and then only remove it from the queue when the message has been successfully posted. celery shared_task slackclient SlackClient rest_framework.views APIView sc = SlackClient(<bot_access_token>)
    sc.api_call( ,
        channel=<channel_id>,
        text= ,
    ) notify_slack.apply_async() from import from import from import @shared_task(autoretry_for=(Exception,), retry_backoff=True, max_retries=None) : def notify_slack () "chat.postMessage" "New Cards in Trello detected" : class TrelloForwarderAPIView (APIView) : def get (self, request) We use Celery, RabitMQ, and Django. Celery’s apply_async method publishes a task to RabbitMQ. In our case, it’s a highly available cluster. Celery workers look out for new tasks and try to execute them until they’re executed, without exception. Now, we can be sure that we won’t lose data if the Slack API causes exceptions on our side. 2. “Broken” doesn’t always mean that “everything is broken” Slack is huge. It has millions of users online and uses multiple servers in regions across the world. If something goes wrong in one place, it doesn’t mean that everything is broken at another location. Even if 99% of servers aren’t able to proceed with a request, there’s another 1% that can. That’s exactly what to Slack in June when its servers weren’t able to process a small fraction of our requests. However, our server didn’t give up, and sent everything successfully. Just nine messages got stuck at one moment, but we’ve sent them successfully since then. happened 3. Be notified early We love it when Slack works, but we need to know when it doesn’t. That’s why we monitor Slack ourselves. We check message delivery, verify API responses, and so forth. We proactively monitor Slack and sometimes detect issues, such as the one detailed in this message regarding the , before anyone else. blocks API If you’re building your service on top of another service, then write scripts and automated tests that proactively monitor the third party. That’s the only way to be the first to learn about a problem. 4. Prepare a Plan B OK, so you’ve built an awesome Slack app that works even when Slack doesn’t. How can you be sure that everything is ready? . It’s just a small example of a fascinating subject called chaos engineering ( ). Simulate a problem, even in production, then check your monitoring and your backup systems. Try simulate Slack downtime yourself https://principlesofchaos.org/ A huge part of this trick is known as “graceful degradation.” It’s the ability to deactivate non-critical parts if they don’t work as expected. In the case of total Slack downtime, our app will switch to a back-up delivery channels, such as e-mail, phone, or another messaging service. It reduces functionality and delivers business value anyway. 5. People should be close We don’t let a critical situation evolve without qualified people watching the process carefully. Even if a case-critical situation is 100% what we expected and recently tested for, there’s still a huge chance of the unexpected occurring. Sometimes, engineers consider how to fix problems temporarily while causing as little damage as possible. For instance, during the issue with the blocks API, we were able to find a workaround pretty quickly. We dealt with the issue before Slack fixed the problem on its side. 6. Talk with clients When discussing outages, we’ve chosen a proactive strategy. We actively inform our users about outages before they notice anything suspicious. If that happens, they won’t consider our service unreliable — they’ll be confident that we know about the problem and are doing our best to fix it. Fun fact: At the very beginning ,we did not have reserve notification channels about a downtime. For the first time, I had to look for users on social networks, write on Facebook, introduce ourselves, and make sure that we didn’t let anyone down. It was really creepy, but, in the end, people reacted positively and became regular users. The six principles described above are universal. They can help to build a reliable service on top of any other third party, and they’ve already helped us with Slack. Each product has different requirements so it's up to you which practices to implement.

Facebook

Slack

Super

Trello

Twitter

In-Slack Incident Management platform Amixr.IO, CEO

How to build a Slack App more reliable then Slack

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

11 Customer Feedback Software that Would Revolutionize 2022

11 Customer Support Response Templates That Will Improve Your Email Management

13 Essentials from 🔥Hell🔥 of Referral Marketing: A Comprehensive Guide to KickRef Success

20 Tips for Selling on Depop App: 2021 Edition

178 Stories To Learn About Customer Experience

11 Customer Feedback Software that Would Revolutionize 2022

11 Customer Support Response Templates That Will Improve Your Email Management

13 Essentials from 🔥Hell🔥 of Referral Marketing: A Comprehensive Guide to KickRef Success

20 Tips for Selling on Depop App: 2021 Edition

178 Stories To Learn About Customer Experience

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps