Your CTO handed you a project to revamp or build your product’s notification system recently. You realized the complexity of this project around the same time as you discovered that there’s not a lot of information online on how to do it.
Companies like LinkedIn, Uber, and Slack have large teams of over 25 employees working just on notifications, but smaller companies like yours don’t have that luxury. So how can you meet the same level of quality with a team of one? This is the second post in our series on how you, a developer, can build or improve the best notification system for your company. In this piece, we will learn about scalability and reliability.
The modern web-based application relies on notifications as a way of connecting a product with its users. Notification types include push, SMS, email, and direct messages. There are many helpful tools for building a notification system, but it’s no easy task, especially when reliability and scalability have to be taken into account.
For a company to grow, it will eventually need to decide between the cost of building and maintaining its own system or opting for the functionality and proven reliability of a third-party product. This is known as the classic build-vs-buy decision.
While the cost of purchasing a solution may be clear, the cost of building your own can be difficult to calculate. In this guide, we cover building a scalable and reliable notification system in detail to give you an idea of the required effort.
Scalability and reliability are two distinct yet interrelated aspects at the core of a good notification system. You achieve reliability when your customer receives all of your notifications without errors or duplicates. This means reaching your customer consistently and on time. Scalability is where your application can handle higher notification volumes as a result of your product’s growth.
It costs time and money to improve a notification system’s reliability. If you’re still looking for product-market fit, it might not make financial sense to prioritize reliability when your resources might be better allocated elsewhere.
Once you find product-market fit and grow your user base, however, your notification volume will increase quickly. If you’re growing fast, you might choose to invest more into fixing other critical parts of your SaaS application instead of improving your notification system’s reliability. But you might jeopardize your product’s growth if your customers don’t receive your notifications due to errors, timeouts, or delays. If a problem-ridden notification system starts impacting the user experience, it’s just as likely to impact your bottom line.
Scalability and reliability are both key considerations for any build-vs-buy decision. For example, when the feature management platform LaunchDarkly was making its own build-vs-buy decision, it had to consider its SLAs, SLOs, and SLIs as part of its investment in a notification system. It had recently closed its Series D funding, and substantial volume and load, compliance, reliability, and stability were all key factors in the decision-making process. LauchDarkly decided to go with Courier because the platform met LaunchDarkly’s strict scalability and security requirements, provided necessary features, and fit seamlessly into LaunchDarkly’s tech stack.
Scalability and reliability are two different aspects. But both become concerns if you want your company to keep up with a growing customer base. If you lack one or the other, you’ll likely meet problems along the way.
If your notifications lack reliability, your brand’s impact stands to lose. To a product like Slack, a delayed push notification does not have much utility. In Slack’s case, timeliness is crucial to creating a real-time conversation between team members.
But even if you’re not Slack, losing early users’ trust in your notification system can slow growth. Your early adopters will be unlikely to recommend your product if they don’t trust the way it works. Duplicate notifications represent another scenario that can be a turnoff for early users. Receiving duplicates of notifications frequently suggests to users that the product isn’t stable enough to use, so early adopters might hesitate to share the product with their friends or colleagues.
So, what is the most common source of scalability and reliability issues in a notification system?
Based on Courier’s experience, it’s the fact that notifications are rarely spread out evenly over time. The reality of unpredictable volume spikes requires an understanding of how to scale infrastructure to handle high volumes at a reasonable cost. If a system doesn’t scale well to accommodate peaks, notifications will end up being processed and delivered beyond their relevance window. In the worst-case scenario, an overwhelmed message queue can result in a system outage. In short, if you find your service growing and your notification system is not equipped to handle it, you are taking on considerable risk.
Moreover, a notification application needs to be kept available to its users while its code is replaced. If you designed your application without keeping that in mind, you face the possibility of extended downtime while you work on it. Downtime means your users won’t receive notifications and won’t be engaged with your product. Ultimately, designing your notification system to reduce downtime saves you both time and money.
A system built without both scalability and reliability in its design patterns also risks frustrating and overworking your engineering team. Engineers on call risk getting burnt out if they have to constantly respond to alerts in the notification system. In addition, if the engineering team needs to repeatedly attend to notification issues, they might miss valuable product priorities like adding new features, improving user experience, and creating integrations.
To build a good notification system, you need to know how to measure its reliability. Read on below.
Site reliability engineering is a way to manage the operation of large software systems. The main tools in site reliability engineering are SLIs (Service-Level Indicators), SLOs (Service-Level Objectives), and SLAs (Service-Level Agreements). These are standards that form agreements between users and service providers, which specify the details of how a product is offered and the consequences if certain provisions are not met.
The key component of any reliability measurement is the way your customers perceive your product. What levels of latency in your API do your users associate with an application that’s running smoothly? How long would a customer wait for the user interface to load before deciding that it’s broken? How soon should asynchronous jobs be completed so that your customers can proceed with their day? SLAs, SLOs, and SLIs are tools to represent numeric answers to questions like these.
An SLI is a metric that establishes the standards by which a service is to be provided to the user. A service-level indicator could be the speed of a database operation or the size of a notification queue. These are the actual metrics that you would view in a tool like AWS CloudWatch or Datadog. The SLI, like the measurement, is what sets the basis for an SLO.
A service-level objective is the summary goal that you as a provider want to attain. An example would be a specific latency of a notification endpoint, including the latencies of underlying middleware, queues, or databases. Here, you’ll especially need to understand which metrics actually matter to the customer and tailor your product objectives in that direction.
The final layer is the SLA. Your service-level agreement is a legally binding contract with your users. It is based on the SLO and the metrics provided in the SLIs. SLAs typically reflect the targets defined in the SLO layer. An example would be an endpoint being available and returning within 1 second for 99.9% of the time. If your product falls behind the target, a customer might get the right to request a refund for your service. So SLAs tie service objectives to direct financial losses when objectives aren’t met.
These components all work together to provide a specific range of metrics within which your product is operating correctly. Paying close attention to SLIs and SLOs, which should be tailored to the customer, can help identify problems before your customers do. Things will go wrong, but how you respond to each situation will make a big difference.
Notifications will be a fundamental part of your functionality. An example of an SLI could be the size of the notification queue, and an SLO could be the latency of processing a notification from creation until it's sent to the user.
While most companies will not cover their notifications under an SLA, it still might be necessary for certain circumstances. For example, a B2B CRM application where notifications need to be used as reminders of upcoming client calls will probably include notification-related standards as part of an SLA. If your product requires coverage of notifications under an SLA, take care to ensure that your product metrics and objectives are aligned with your agreements to avoid overpromising and consequent legal issues.
The idea of having to use a provider API like Mailgun or SendGrid for sending emails, or interacting with a push notification service like Firebase Cloud Messaging for iOS and Android notifications, can be a reliability concern. If you are on the fence about how using a third-party provider would impact your reliability metrics, read on below.
In considering the scope of building a notification solution, you might feel reluctant to add provider APIs into the mix and therefore focus on managing all notifications in-house. Instead of using a provider like SendGrid or Twilio, you might be considering setting up email or SMS infrastructure in-house.
But is using provider APIs a reliability concern?
Courier, for example, is an HTTP API. It is true that HTTP requests can fail due to connectivity issues, SSL errors, or unexpected delays. Perhaps the customer doesn’t receive an HTTP response to their API request at all. You can attempt to make such failures less common by only relying on services that reside within your network space, but due to the complexity of today’s networks eliminating such issues completely is not going to be possible.
In our experience, the answer is not to avoid using APIs altogether but in how to create mechanisms to attenuate API request failures.
At Courier, we built mechanisms to avoid many HTTP API issues. For example, Courier uses idempotency keys to safely retry messages without duplicate sends to the customer. Integrating idempotency and other fault-tolerant processes is a vital part of building a reliable notification system.
Now that we covered the core concepts, let’s discuss specific suggestions for building a scalable and reliable notification system for AWS users.
If you’re using AWS, there are many tools to help you build a scalable notification system. DynamoDB and AWS Lambda are some of the AWS services that we use at Courier, and applications built using these services can be easily scalable and cost-effective to run, while requiring little to no upkeep.
Still, you should take care to avoid performance bottlenecks even when using services like Lambda and DynamoDB. Below we’ll share some tips based on our experience using AWS services.
How you build for scalability depends on the tools you choose, at least in terms of how they access your data. A system is scalable when it can still perform within its service level objectives even with an increase in volume. Whichever tools you decide to use, plan for a notification system that can handle sudden and drastic increases in data volume.
When creating your own notifications application, it’s crucial to pay attention to access design patterns from the outset. You’ll need to understand how you’ll be accessing data before you start building the application. It might not be complicated to build a simple notifications application into your product, but problems don’t typically become apparent until later in the implementation or as you’re trying to scale.
If you’re using DynamoDB, a common problem with access patterns is partition key structure. DynamoDB uses primary keys that consist of two components: a partition key and a sort key. DynamoDB uses the partition key of a table to distribute the table’s data across partitions. The more evenly the table’s records are distributed, the higher the overall throughput of the table will be.
To determine the partition a record needs to be written to, DynamoDB runs its hashing function on the record’s partition key. Based on the hashing function’s output, an item is mapped to a specific physical location in the DynamoDB system. Each DynamoDB partition has a limited amount of throughput capacity. If one of your table’s underlying partitions were to receive more reads or writes than your other partitions, the throughput of your DynamoDB table would be lower than if the load were evenly distributed. Overloading one partition while underloading the others due to too many records having the same partition key is usually referred to as the hot key problem.
Ineffective load distribution between partitions in DynamoDB.
The partitions are managed by DynamoDB itself, so the only way for a developer to address an issue with record distribution between partitions is to change the structure of the partition key. A common solution to the hot key problem is to create a composite partition key. In our example above, the tenant_id column is used as a partition key, and this configuration causes a performance bottleneck on Partition 1 when working with records for the tenant tenant_1. To address the problem, we can create a composite partition key by combining the tenant_id and user_id attributes. See the impact of this change in the following illustration.
More effective load distribution between partitions through the use of composite partition keys.
In this example, the load is distributed more evenly because the records now have different partition keys.
Similarly, if you’ll be using AWS S3 to store attachments you send with your notifications, pay attention to your design access patterns. Improperly designed S3 bucket and key structure can cause throttling and therefore impact the performance of your application.
Depending on the volume and predictability of your usage, reserved capacity provisioning will be much cheaper than autoscaling or statistically provisioned capacity. DynamoDB’s auto scaling feature can handle unpredictable load patterns without human intervention, but autoscaling can get expensive. If you have a predictable volume of notifications, then maintaining infrastructure to service that volume will typically cost much less than having to handle unpredictable spikes. It’s even possible to mix auto-scaling with reserved capacity (see this example of cost optimization with DynamoDB).
Finally, you need to be able to monitor and analyze your performance metrics and general infrastructure. This is especially important for scalability since the metrics serve as indicators that can pinpoint issues or inefficiencies in your access design patterns. A good monitoring setup can also assist in ensuring security measures and legal compliance. For this, Courier uses Datadog, which can monitor servers, databases, and other tools.
As you’re well aware, a scalable set-up for a notifications application requires serious planning before building. Since your needs will be different depending on your usage, a scalable system will have a good foundation that can accommodate higher volume without huge expense or re-building. Aim to understand your own design patterns and tools and how they can work for your application instead of against it. For example, DynamoDB is not a relational database and should not be used as such. You’ll need to design meticulously early on in the process since getting it right the first time will be invaluable to your company.
At Courier, we use AWS Lambda to run most of our notification-related code. If you’re going to be using AWS Lambda for your notifications application, it’s crucial to tune Lambda configuration to your required usage.
For example, we recommend modifying default timeout values. The default timeout setting can differ significantly between various AWS services and AWS SDK programming languages. In the Node.js SDK, the timeout is 2 minutes, while it’s 60 seconds in the Python SDK and 50 seconds in the Java SDK.
Incorrectly matching timeout settings to your use case can lead to unexpected behavior. If your Lambda function takes longer to run than the timeout configured in the SDK, you might run into unexpected timeouts.
Our typical strategy is to right-size the timeout settings between Lambda functions, the AWS SDK, and other locations in your systems where timeouts can occur. The right timeout values will depend on your needs and the ecosystem you’re working with.
In addition, tuning your AWS service configurations and the AWS SDK parameters based on factors like queue visibility, numbers of retries, and polling frequency can generate a significant reliability payoff if you line the settings up in compliment to each other.
Building a notification system into a product is not for everyone. The process is time-consuming, complex, and expensive. Your particular requirements will ultimately dictate a preference for either functionality or cost. A startup with a product that hasn’t yet found its product-market fit has to focus on finding early customers and getting their feedback. But established companies with a proven customer base will have concerns related to higher volumes, stability, and compliance. This would require more functionality and higher maintenance costs.
This piece taught us that scaling reliably can be hard, but despite the complexities, it can be done without sacrificing throughput for maximum reliability. Tune in for the next post in this series to learn about routing data and setting up preferences to create the best possible experience for the user getting your notifications. To stay in the loop about the upcoming content, subscribe below or follow us @trycourier!
First Published here