At your CTO’s request, you recently started researching how to go about revamping or building your product’s notification system. You realized the complexity of this project around the same time as you discovered that there’s not a lot of information online on how to do it. Companies like LinkedIn, Uber, and Slack have large teams working just on notifications, but smaller companies like yours don’t have that luxury. So how can you meet the same level of quality with a team of one? This is the fourth and final post in our series on how you, the developer, can build or improve your company’s notification system. It follows the first post about identifying user requirements, the second about designing with scalability and reliability in mind, and the third about setting up routing and preferences. In this piece, we will learn about using observability and analytics to set your system and company up for success.
Developing an application can often feel like you're building in the dark. Even after development, gathering and organizing performance data is invaluable for ongoing maintenance. This is where observability comes in—it’s the ability to monitor your application’s operation and understand what it’s doing. With close monitoring, observability is a superpower that allows developers to use various data points to foresee potential errors or outages and make informed decisions to prevent these from occurring.
As you build your product, consider the implications of having complete observability built into your notification system. As a developer, you’ll need to identify and quickly resolve issues by understanding how your product is performing. In the bigger picture, observability ties your technological infrastructure to your overarching product and business objectives. These key insights will also help to scale the product and manage data as your business grows.
You’re here because you want to build an application with a powerful notification system that can rival those of existing products. In this guide, you’ll learn why observability and building strong monitoring mechanisms are crucial. Here are four core observability use cases.
Telemetry logs are the backbone of an observability system. The more infrastructure you have, the more data there will be from each instrument and service. You need to be able to understand this data. Logs can provide additional context that allows developers to determine where or why certain issues might be occurring, and effectively, how to fix them. For example, if every API request to your notification system results in a specifically formatted log line, it becomes possible to scan those log lines for anomalies. Event logs of particular actions occurring in the system, like privileged access or settings changes, can sometimes shed light on unpredicted behaviors in the system. At the very least, you should have some kind of safety net or global catch to notify you when errors creep up.
When users are unable to receive notifications, you can use various logs to determine what factor(s) prevented that notification from going through. If, for example, your app is unable to deliver messages to 20% of your users, logs would reveal that those same 20% of users are using your application on a specific device type. You’ll know right away that your application has a bug that prevents it from functioning properly on that device type and act accordingly. You can also update your system to prevent this issue from occurring for future users.
Let’s say a user has contacted you to report that they have unsubscribed from emails but is still receiving them. Your customer care team should be able to note relevant logs or errors, communicate with the user to ensure a good relationship, and also feed that information to the developer team for potential resolution.
Observability can also help you improve overall user experience, present and future. As you consider important data points to observe within your system, think about what might impact your product, specifically through the lens of your customers. If you can see some metric that might impact user experience, work to improve it before it’s an issue. For example, do your time-sensitive notifications get delivered as quickly as they should? Are all messages being delivered only once? If you’re using multiple channels, are messages being routed correctly?
Additionally, if you’re using a third-party provider like SendGrid or AWS SES, you should absolutely observe connection health. If there’s a provider issue, you could notify your customers, such as through a status page, that notifications might not be working optimally. You might not be able to control the operational status of your providers, but you can still take action to maintain your customers’ trust.
A proper observability environment should provide you with a holistic perspective of your application’s state. Based on usage and performance, you can clearly reason about how certain factors might affect your product. You can gather a real understanding of the rhythm of your notification system. You might see that your users are receiving fewer notifications at specific times in the day or year. How do you know if this is due to your application sending fewer notifications, or because they’re not getting delivered at all? With notifications, data can fluctuate drastically. Peaks in error rates can be cause for concern and require immediate engineering support, while fluctuations in send volume can be observed with caution to ensure that the application is sending the right amount of notifications.
As you set up your observability, try to expose the data through user-friendly dashboards and interfaces, such as those that Datadog and Honeycomb offer. All of the collected data should be organized to provide clear insights on the application’s behaviors. Proper data visualizations are invaluable and should be tailored to more than just developer teams. For the customer service team, it is helpful to understand a specific user’s experience when something goes wrong. Likewise, commercial or marketing teams can glean insights for business development.
If your users trust that you can monitor your application efficiently and resolve issues quickly or even use data to prevent their incidence, you’ll build a strong foundation and customer base. How you organize your observability system should ideally connect to your service-level metrics, which should be recorded in your SLAs (service-level agreements) that you have with your users. The observability data will be more relevant to your business if it is connected to your SLIs (service-level indicators) and SLOs (service-level objectives). An observability system that monitors all varieties of resource consumption without this connection to user experience might not foster the type of growth you want.
Tracking and analyzing data on how your users are interacting with your notifications can help drive business development opportunities. Link tracking, for one, is a core observability component within a notification system. Did the user click on your notification? When and how many times?
Analyzing your observability data, and especially what you do with the resulting insights, is vital to further business growth. Your observability metrics will allow you to determine if the product is meeting business expectations. For instance, you can use observability data to make scaling decisions. If you want to scale, how might increases in volume affect your system? As you aim to understand the signals from the noise in all of your observable data, every signal you find needs to drive meaningful change. This is where you will find opportunities for further development.
Once you know how to use data to effectively monitor your application and make informed decisions, where do you start? The ultimate goal is to design your observability environment so that it is able to understand data, compare the data between various channels and infrastructure, and make it actionable.
There’s a way to structure data to make it more useful to engineers, customer service, and business development teams. Making sense of this data requires two key measures: the correlating and normalizing events data.
Correlation illustrates how different events are connected to each other and how they are connected to different users. In a notification system, this means that every outbound message, the receipt and opening of a message, and every click on the notification are measured in the ways they are statistically related to one another.
Normalization refers to how we understand data points from different sources or channels and record them in a way that makes them comparable. That means re-scaling the data so that it all varies on a similar scale. For example, how would data from email notifications from SendGrid compare to SMS notification data from Twilio? These are not only different companies, but also entirely different channels.
Correlation and normalization of data allow you to have a more complete understanding of how your product’s notifications are performing, both in general and in relation to one another. You would then be able to filter through data and for example, find all events related to one user. This would include all email, SMS, direct message, and push notifications regardless of how and where the user received them. You can also filter through the data to find all emails sent out on a specific date under certain conditions, like for multiple users or in specific regions.
Ultimately, if you’re working with several providers for your notification system, correlating and normalizing data points will be a vital step to achieving observability.
Your setup for an observability system will depend on your tech stack. For the sake of this example of a serverless notification system, we’ll use AWS DynamoDB and Lambda. Generally, observability boils down to metrics, logs, and traces—and how you manage them.
Since we’re using AWS DynamoDB and Lambda, AWS CloudWatch is a great tool that provides built-in monitoring in connection with many other AWS services. CloudWatch collects both custom logs and those of other AWS services, as well as infrastructure metrics.
At Courier we import these metrics and logs into Datadog, which aggregates everything on a dashboard. Datadog can be integrated with both DynamoDB and Lambda, as well as hundreds of other services.
In a notification system using Lambda and DynamoDB, you should monitor all default performance metrics such as the number of functions getting called, the number of rows being modified, and so forth. Some of the notification-specific metrics not already mentioned here include latency, error rates, and request rates.
For DynamoDB in particular, it is valuable to monitor access patterns, such as inputs and outputs, in order to avoid hot keys. A tool like CloudWatch Contributor Insights can help identify and analyze these access patterns.
AWS Lambda can automatically capture logs and then connect them to AWS CloudWatch with the Lambda Logs API. Through this API, extensions can subscribe to function logs, extension logs, and the Lambda platform logs for events and errors. Datadog can import these as well.
For logging DynamoDB activity, AWS offers CloudTrail. All API calls for DynamoDB are captured as events. You can search through the event history, or you can create a trail for ongoing event delivery to an AWS S3 bucket. This allows for an extended record of events and you can integrate these logs to CloudWatch.
If you’re using a tool like Datadog, you might forward your CloudWatch logs and metrics to Datadog using a Forwarder Lambda function. If you’re also using Kinesis in your tech stack to quickly process streaming data, you can use their Firehose delivery stream to forward logs to Datadog as well.
There are many vital metrics and logs to observe in a notification system. Some, already mentioned in this article, include the sending of notifications, receipt and opening of notifications, and any clicks on notifications. Other important ones include deliverability, open rate, and conversion rate. You’ll also want to note the channel and provider for each notification, such as Twilio for SMS. You are inherently looking for two different things. The first is how your notification system is operating so that you can resolve potential bugs and strive for improvement. The second is how successful your notifications are in engaging with your users. Any log or metric that might help you define those two components will be useful.
Finally, traces can greatly help understand the context of logs or metrics. Traces track requests from beginning to end through all of the components in your tech stack. Tracing is especially key when your tech stack involves several intertwined systems, as it can help identify bottlenecks between those systems. Traces are normally integrated through logs, such as a unique ID for each request. One idea would be to attribute each operation in DynamoDB and Lambda to a specific notification that is sent by a specific user.
If you’re building your own notification system, observability is vital for both maintaining and scaling your product. It is a preventative measure that can improve the performance and health of your system. Whatever the relevant observability metrics may be for you, the data should be measurable, actionable, and meaningful. Remember that it’s what you do with your data that impacts your business the most.
This piece taught us about the necessity of observability and analytics to monitor the functioning and performance of your in-house notifications system, as well as the advantages it provides for its future ability to scale. This is the last post in this series about how to build your own notifications. Soon, we will release an eBook so that you can access all of this information together and use it as a reference as you get building. To stay in the loop about the upcoming content, subscribe below or follow us @trycourier!