The popular messaging service recently experienced a The majority of which were individuals working from home or remote learning due to the Coronavirus pandemic. The outage lasted for an extended period, potentially impacting their service level agreements (SLAs), impacts the brand while coming hot on the heels of the by . Outages like the one Slack experienced are increasingly common in a technology-focused society; so how do we avoid costly issues like this? Slack global outage that impacted millions of users. announcement of their $27.7 billion acquisition Salesforce What causes an outage? An outage (also known as downtime) is a period of time when a given service or system is unavailable, failing to provide and perform its primary functionality. Completely removing the possibility of experiencing an outage is almost impossible, however, there are a number of ways in which the odds of an outage can be significantly reduced. Some of the things that can be done to try and avoid an incident are: Testing Horizontal Scaling Chaos Engineering Resiliency Improvements to the Software Improving Observability Improving Processes What testing should be carried out? Having a high-test coverage is only part of the story when it comes to testing an application's robustness . A high unit test coverage can be deceiving as this simply proves that the software functions as expected, in isolation. Unit testing should be combined with other forms of testing including integration testing, security testing, and performance testing (which should also cover various subtypes such as load testing and stress testing). Including a wider array of testing types ensures that both the functional and non-functional factors that could impact the overall system are being checked under various conditions that could occur in production. What is horizontal scaling? Horizontal scaling improves the application's stability and robustness. It simply means that each component in the system can increase its capacity by adding more instances. For example, a simple CRUD application that uses in might horizontally scale by adding additional RDS instances as read replicas in multiple availability zones (AZs). RDS AWS MySQL What is chaos engineering? Chaos engineering is a form of resiliency testing when they built their application. It functions by turning off random components in their architecture to simulate outages for particular services. This aided Netflix in their migration to a cloud-hosted infrastructure by allowing them to identify potential pitfalls. This occurs particularly by finding dependencies that, when removed, would have caused incidents in a production environment. Doing this regularly in a non-production environment allowed these incidents to occur in a controlled environment that does not impact customers, mitigating the potential impact on the business. created by Netflix Chaos Monkey How do we improve resiliency? There are a number of resiliency improvements that can be made to an application in order to reduce the chances of it causing or intensifying an outage. Some simple improvements can include adding an to call to other services, building circuit breaker functionality into the system, and between services in order to provide a smoother, more consistent throughput to the service. exponential backoff retry strategy implementing queues Implementing queues also brings with it the advantage of assisting in decoupling the components from each other, meaning that each part of the system is more independent and has fewer dependencies. These strategies can also help to reduce load and enhance the fault tolerance of the application, both of which result in a higher level of resiliency. What about observability? Successful deployment of a component is only part of the story. Many deployments appear successful at first before running into issues. Such issues include using of resources such as memory, an unexpected failure that occurs irregularly, or even security issues that are not at first apparent. These reasons are why observability is incredibly important. Logs, metrics, and traces allow you to continually monitor your system, as well as notifying your technical team when a component begins to show signs of an underlying problem using alerts. What processes should be in place? Good processes should underpin the entire engineering practice and ensure that human error is avoided or that any incidents that occur are rectified swiftly. Code reviews should be carried out to mitigate as many potential issues as possible, whereas processes like on-call policies and incident post-mortems foster a culture of continually iterating and learning from mistakes when they do occur. While these methods do not necessarily prevent outages, they are strong foundations when it comes to reducing the odds of an incident occurring and it causing a total outage. These incidents are rare, but when they do occur they can prove costly to an organization and preparation should be in place to minimize these costs and the chances of such an occurrence. Previously published at https://kylejones.io/how-to-reduce-the-chances-of-an-outage

Google

Netflix

Salesforce

Slack

Zones

How Hackathons and Open-Source Software Can Help Drive Innovation

A Deep Dive into AWS Firecracker

2021 - HackerNoon Contributor of the Year - INFLUENCER-MARKETING

Too Long; Didn't Read

How to Reduce the Chances of an Outage

How to Reduce the Chances of an Outage

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Deep Dive into AWS Firecracker

10 Lessons from 10 Years of AWS (part 1)

10 Lessons from 10 Years of AWS (part 2)

111 Stories To Learn About Architecture

13 Expert Tips to Improve Your Web Application Performance Today

4 Skills You Need to Become a Distinguished Developer

A Deep Dive into AWS Firecracker

10 Lessons from 10 Years of AWS (part 1)

10 Lessons from 10 Years of AWS (part 2)

111 Stories To Learn About Architecture

13 Expert Tips to Improve Your Web Application Performance Today

4 Skills You Need to Become a Distinguished Developer

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps