As we look forward to 2021, Synthetic Monitoring continues to be as important as ever in understanding the performance of your app or website. But your synthetic monitoring is only as good as the tool you're using and there are a lot of product choices. Since selecting the best one for you is critical, the choice can be overwhelming. Price, setup ease, accuracy, and more play a part in the best solution.
In this article, I reviewed six popular synthetic monitoring choices: Updown.io, UptimeRobot, StatusCake, Site24x7, Pingdom, and DataDog, and applied several measurements against them in a real-world situation to see how each product performed. Then, I set up a typical example deployment on DigitalOcean and ran tests for five days against each product. Finally, I collected the metrics and compared them against one another to determine which tool was the best.
Let’s start with a little background.
Synthetic monitoring is the practice of simulating what an end user might do on your website or app, and then monitoring that website or app for performance. It's a key tool for measuring end user availability. By the time a bug or regression hits your active site, it’s too late to prevent the impact. However, these tools can help you quickly resolve issues and contain the damage.
One of the most common types of synthetic monitoring is using externally-hosted tools to periodically open a site or URL and determining if it loads properly and how long it takes to respond. "Properly" can be defined during the creation of the test, and what happens with that data can vary from product to product, but typically includes features such as screenshots of failures, waterfall latency to determine where page load slow-downs derive, or even fully synthetic workflows such as logging in as a user and executing a transaction.
While it's possible to roll your own synthetic monitoring through localized testing, there’s no replacement for a solid product that uses last-mile monitoring to report exactly what your customers are seeing—and from an unbiased source. That's why DevOps teams tend to rely on these products to help correlate issues outside their scope of monitoring and validate when a user is seeing impact even when the systems think they're healthy.
Now let's look at the considerations I used to compare the products. The relationship between your product and your monitoring tool will hopefully exist for a very long time. So it’s important to get it right the first time since a miscalculation in monitoring services could be the difference between customer success and customer distrust. So we’ll be comparing our services using six key requirements that any business and team should consider before committing to what will be a partner in your product’s success:
Let's look at each of these in a little more detail.
As with many business decisions, our number one consideration is price. There's no reason to use a service that begins at a low cost but results in exorbitant costs over time as you grow. So we’ll be explicitly looking at cost per site on a per-month basis.
2. Setup Ease
The next biggest consideration is how much work is involved in the monitoring, alerting, and management of the services. If it is too difficult, your team may not even use the service that you’re paying for in the first place. Unfortunately, there are many variables in determining “difficulty” as a Key Performance Indicator. Here we'll give a score of 1-5 (5 being the best) based on difficulty in finding the options to create, edit, and delete basic synthetic tests.
If alerts are constantly firing they can breed a sense of distrust and complacency or “Alarm Fatigue” that commonly plague operations and developers alike. If a tree falls in a forest, and everyone hears it every 10 minutes for days at a time, is anyone really monitoring? The antithesis is also true. If no alarm fires, was your site ever really monitored at all? There are numerous considerations here, so I will assign a score of 1-5 (5 being the best) based on the following metrics:
4. Day 2 Operability
Implementation and reporting are just the first steps of any observability platform, but the things your human teammates will care about most is how easy it is to utilize and interpret the results. While this is a more subjective analysis, it’s no less important. This will pass or fail on being able to easily and intuitively navigate to an alert, and identify the source of the failures.
5. Notification Channel Integrations (Slack, Discord, Email, Pagerduty, and others)
Users should be able to get notifications where they feel most comfortable, and different types of systems may require different types of alert severity. It may not be important to dial an engineer for Jira failures over a weekend, but it’s very important to get all hands on deck ASAP when the entire site is down. We’ll be scoring 1-10 (10 being the best) with special weight for Slack, Discord, email, and Pagerduty integrations. I went with 10 here instead of 5 to more accurately account for the range of integrations possible.
6. APM Integrations
Finally we’ll be considering the difficulty to triage incidents. Namely, do each of these products offer options to tie back into your core monitoring systems for better correlation and root cause analysis?
To fairly test each monitoring service, I've built a common paradigm for website edge components on DigitalOcean in three different datacenters (San Francisco, New York, and London). I used DigitalOcean load balancers and CloudFlare Geo Routed Edge Proxies. Each test will run against this setup to determine site availability, status code, and response time. The service itself is a simple Nginx container running on Debian 10 hosts.
We will inject faults by partially shutting down services for 20 minutes on a single container, and then entirely for 30 minutes across all containers. Afterwards, we will determine how long it takes each service to catch the failure and alert. Agents will be set up with similar configurations.
By shutting off services on various hosts, we can simulate a code deployment or application failure that should be enough to take services out of the load balancers configured in DigitalOcean and CloudFlare.
Based on our design, we should be extremely resilient to such faults and agile in our rollouts.
However, as seasoned admins, we know that there is still a limit to even this level of distribution and resiliency, and it is our fail-safe last-mile monitoring that alerts us to critical customer impact.
Because each product has vastly different timings allowed for trial accounts, we'll level the playing field by setting each service to 15-minute threshold alerts and multiple failures before alerting where applicable. Each product should be able to tell us the moment it detects the site is no longer reachable during our deployments. In a perfect world, our partial deployments (though they may cause some latency or reduced capacity) shouldn't impact any customers, and so should not alert for downtime.
We're all set up and ready to go, so let's deploy the products and see how each one performs. I'll provide a summary table for each product, including the setup details, metrics/results, and then an overall impression. At the end, I'll summarize my thoughts on which product is best.
A relatively small French company, Updown's goal is to provide an inexpensive and user-friendly tool with a simple, slick interface.
Setup was extremely simple: adding a URL and a handful of settings immediately allowed me to monitor my site with a beautiful dashboard. The score for configuration notifications took a hit due to the out-of-sight, out-of-mind location of the options, and the limited number of options available. To configure Pagerduty, I had to set up a custom webhook or trust in the SMS notification options provided by the tool itself.
Metrics and Results
The other notifications that I configured came through with a concise summary of the problem, but the tool didn't provide any immediate road back to the original alert like the other tools on this list. Furthermore, after browsing back to the site manually and diving deeper into the alert itself, I couldn't easily zoom or gather more data other than what was provided in the image above.
This tool specializes solely in notifying issues. Beyond that, however, there was a heavy reliance on unaffiliated third-party tools.
That being said, the cost of monitoring with this tool is extremely affordable, and the barrier to entry so low, that it may still be worth considering, if you are willing to put in the work of figuring out the appropriate way to set up two-way escalation methods like Pagerduty.
With simple monitoring, simple setup, and a free-tier option to boot for entry-level monitors, UptimeRobot is a low-barrier option for quickly implementing site monitoring.
Test configuration was extremely simple, however, that's all you can really do in the primary interface of this tool. It was difficult to find the escalation points as they were nested in the global configurations. But they did provide many pre-canned setup options making my notification setup a breeze.
Metrics and Results
The escalations all fired very quickly without any false positives. Thboth outages and only alerted on the latter complete outage. Unfortunately, diving into the ale product detected ert itself was somewhat vague as I couldn't see where the failures occurred. The summary only showed a rough graph of response times and the timestamps. There wasn't zoom functionality, which means that it was only useful as a trigger to start investigating rather than real clues on where to dig in.
The setup is super easy and the price is fantastic. With accurate escalations this covers the very baseline of everything I’d want in a synthetic monitor. Unfortunately, due to the lack of functionality available to understand the nature of the failure, this tool is best used as a trigger to the real work rather than a finder of smoking guns.
StatusCake is an inclusive monitoring solution specializing in web health checks. With a free tier option as its cheapest option, it's a great solution for a place to start.
StatusCake has clear instructions on how to create new tests. However, it lost some points due to its very busy interface. But after the initial color shock, the workflow in the navigation pane was simple to understand. And the integrations for alerting offered many options for notifications for everything I needed.
Metrics and Results
While the dashboard itself was concise in its information display and clear in its messaging, I did not receive any alerts for the failed test. It was unclear as to whether this was due to incorrect configuration or due to the short duration of the tests (<30 minutes). But without a trigger to draw my attention to this alert, the tool wasn't very helpful.
The user interface is very busy and the color scheme is distracting. On the other hand, the focus on the tasks and ease of creating workflows makes navigating the options simple and understandable. The fact that it appears so simple to configure makes it somewhat disappointing that there weren't any escalations performed.
Site24x7 offers a suite of software for monitoring everything from Windows servers to Cron jobs. It bills itself as a one-stop-shop for all your monitoring needs with several inclusive monthly plans.
Initially Site24x7 gave me a very easy-to-follow workflow for adding my first monitor, however, every addition thereafter required some prerequisite knowledge. They do offer some tutorials for first-time users, however, the multitude of options can be overwhelming.
Once I found the general flow of Third-Party Integrations and Web Page Speed (Browser) toolsets, I was able to select from a plethora of curated and pre-configured notification options.
Metrics and Results
This tool detected the partial outage from the deployment, which was represented as a large spike in page load time and 500’s from various regions.
This provided a good balance of sensitivity to alerting as I was able to see different behaviors depending on the situation. I didn't see any other issues after gathering the remaining metrics from the week.
Site24x7 is a bit overcrowded in its functionality causing it to appear opinionated in its approach to workflow. However, with its plethora of customizations, it is a powerful and feature-rich tool.
The product lost some points in accuracy. Although both events were marked as "alerting", I didn’t receive any Pagerduty escalations, which caused a delay in troubleshooting. The RCA in the alerts provided a clear screenshot of the failure and a good direction to start investigation. With the extra toolsets available for integrated application monitoring, this could be a powerful tool in your arsenal.
Pingdom is a well-known product in the SolarWinds family that specializes in synthetic monitoring and user workflow testing. It's been around a long time (about 15 years), is built to be simple and to the point. With a built-in integration into SolarWinds AppOptics for APM and a broad webhook functionality that ties into most platforms, Pingdom offers multiple integration options.
The setup was extremely easy. After only a few clicks and a simple web form to fill out, I could see a summary view of my tests and all the websites I wanted to configure. By selecting "Integrations" in the sidebar, I was able to set up the escalation methods and configure Pagerduty, Slack, and Discord through the webhook interface.
Metrics and Results
Pingdom caught one prolonged outage, but missed the partial outage triggered shortly before it. This was likely because some regions were reporting healthy while others weren’t during my simulated deployment.
I’ve used Pingdom in the past and they’ve only gotten better and more reliable. With a focus on web monitoring, there is no extra clutter, flows are easy to understand, and the overall ease-of-use of the product is high. I rated Pingdom a 4 in accuracy, however, because with my default settings I didn't receive any triggered Pagerduty alerts.
I did receive quick and relevant notifications through Slack and Discord with an easy-to-follow root cause analysis as to what was wrong. Because Pingdom provides a carte blanche webhook integration pattern, the integrations (custom and service-based) are nearly endless.
By digging through the alert, I was able to get a clear picture as to what the monitors were alerting on and from where, allowing me a direct picture into where the problem might be and letting me understand the nature and impact of the failures.
There are also many integrations within the SolarWinds product family, promising Root Cause Analysis pipelines to help me get towards the cause more quickly.
DataDog is one of the more well-known products on this list and has been around for about 10 years (at the time of this writing). It's an all-inclusive option with robust infrastructure and many options for configuration and customization. With such a repertoire, their motto "See inside any stack, any app, at any scale, anywhere" definitely fits the bill.
This product was by far the most complex to set up. It is designed with a vast assortment of options and products. Its high price tag comes with a monstrous amount of configurability in terms of its pre-canned integrations, monitoring options, analytics, and dashboarding. The only barrier I encountered to setting up my notification channels was digging through all the options to find the integration I wanted to use.
Metrics and Results
With the high price tag, DataDog provides the type of data integrity and reporting granularity you would expect. DataDog rapidly notified all of my communication channels, providing clear signals on the who, what, and where the impact had occurred by providing an easy-to-follow link in all channels.
Test setup required a deep dive into several product offerings before I ended up in the correct context (which was found under "UX Monitoring", not “Monitors” in the offerings pane). It was abundantly clear that the breadth of offerings in DataDog’s catalog could be integrated and finely tuned to drive not only triage, but automated mitigation with features like code deployment kill switches and API triggered workflows.
It’s no wonder that DataDog is such a popular brand in the industry. However, without careful considerations, its cost can quickly overrun budgets.
Each service has pros and cons, but based on our data we can draw some conclusions and recommend three of these products:
Though it falls short in triaging and diagnosing incidents, Updown.IO is the most affordable option for basic notifications. This is an excellent option as a supplementary service for simple architectures or non-critical applications.
UptimeRobot has a simple workflow with easy-to-understand escalation hooks for alerting. If you’re looking for a product that allows you to hit the ground running early in an application’s lifecycle, this could be a solid, affordable option. But as you move forward into production scenarios, you might need to partner with additional open-source analytics tools to track down details of your application’s behavior.
Pingdom stands out as the best balance in cost and functionality; it’s easy to create and maintain alerts and the ubiquitous nature of its webhooks allows for easy escalation integrations. Add in the integrations available with the rest of the SolarWinds APM suite, and they are on par with the offerings from DataDog, but at a lower cost.
It’s worth noting that I was able to set up this infrastructure and testing for free in each company’s respective trial periods. All the sales teams that reached out to me were also very helpful without being overly invasive, a rare and pleasant trait in our industry. From this experiment, one key takeaway I can recommend is to try them all for yourself and see which is the best fit for your unique use case.
Also published at https://dev.to/mbogan/comparing-synthetic-monitoring-products-g3i