The place to start when planning monitoring for a new service or a checklist when revisiting an existing one.
While well-architected systems are fault-tolerant and can continue operating correctly when some of its components failed, persistent failures are highly undesirable and can lead to degraded performance and even system collapse. However, with well-planned monitoring you should be able to:
- Anticipate disruptions
- Quickly identify the source of problems
- Trigger automated recovery processes
- Trigger alarms
Service monitoring is a broad topic, and there are numerous sub-topics to choose from. However, it is a core piece of system design we need before we can launch a service. Thus we need a principled, structured way of reasoning about and evaluating our monitoring strategies.
While every service is different, can consist of other smaller services (e.g., microservices) and requires a different set of metrics. Almost always an atomic service would comprise an application that encodes business logic, a compute infrastructure that runs it, a number of dependencies and a network infrastructure to exchange data with dependencies and users. Such basic building blocks provide an excellent top-level structure for monitoring analysis.
In this article I want to offer a conceptual monitoring framework that can be applied to a variety of different service architectures, starting from a Startup-like MEAN Stack to an Enterprise Microservices in the cloud. Moreover, should serve as a good starting point when planning new monitoring for a new service or as a checklist when revisiting an existing one.
First, we start with application-level monitoring. That is the most difficult to get right and the most critical pillar of monitoring. Answering the foremost question “Is my service running?”, whereas all the other monitoring levels only help us to pinpoint the cause of a problem.
In practice “running” means many different metrics and dashboards. So the best way to split complexity further would be the following:
- Business Key Performance Indicators (KPI)
- End-user Experience (EUE)
- Service-Level Agreements (SLA)
In practice, however, these subsets are never entirely disjoint.
Business Key Performance Indicators (KPI)
Your business metrics are the best proxy for determining whether your service runs as expected or not. For example, for an eCommerce website, your KPIs would include something like “card abandonment rate”, “average order value”, “products per order” and so forth. However, this set of metrics is the most distinctive of different services. Thus, you should collaborate early on with your product team to understand what would be the minimum set of metrics.
End-user Experience (EUE)
- First Paint or First Contentful Paint
- First Meaningful Paint
- Time to Interactive
Moreover, group these data by different browsers, platforms, and regions. Check “User-centric Performance Metrics” from Google I/O 2017 for details how to implement it.
Server-side performance monitoring also provides insight into end-user performance. However, we discuss it as a part of service SLA.
One more technique you should employ for collecting both client-side and server-side End-user Experience monitoring is “Synthetic Transaction Monitoring”. Which involves running an external agent that executes pre-recorded user cases at regular intervals and mimics real user behavior.
Service-Level Agreements (SLA)
The SLA (Service Level Agreement) is a promise or contract from the issuer to the customer and often includes:
Monitor request rates: total, by API and optionally by a client. Also, measure the ratio between failures and the total requests. E.g., if you rely on HTTP(S) monitor 5xx error codes, and keep an eye on 4xx errors.
Measure both client-side and server-side latencies per API method. If your clients are from different regions, make sure to group client-side latencies by region. If your end users access your service via a browser, you can obtain client-side latency using Resource Timing API. Otherwise, you should rely on latencies reported by your “Synthetic Transaction Monitoring” or Canary Tests. You may read more on canaries here.
In Database and Distributed Systems communities “Consistency” has different definitions. We refer to the latter. Thus there are two dimensions of consistency: Staleness and Ordering. Monitoring consistency in a cost-effective way is hard. However, starting with Staleness monitoring is easier. For example, you may have a separate scenario as part of “Synthetic Transaction Monitoring”, which creates and removes objects and checks how soon the effect is observable.
It does not matter whether you rely on Serverless Computing like AWS Lambda and Google Cloud Functions, or rent Dedicated Servers at Hetzner. At some point it, would fail. Either due to physical malfunctions, data center outage or resource exhaustion. Whereas resource exhaustion can be caused by application memory leaks, broken log rotation, fleet capacity misconfiguration or DoS attack.
To better differentiate between different failures and assist in identifying a cause we would further split compute infrastructure monitoring into three ties: CPU, memory and disk usages. Additionally, for each metric, you should monitor aggregated statistics (mean, p99) per host class, fleet, and region when applicable.
Here are some metrics you should consider when monitoring your compute infrastructure.
- CPU utilization and CPU load
- Workload versus CPU utilization ratio (cost-effectiveness)
- Process and threads count
- System memory used (total and percentage)
- Swap space
- Application heap used (total and percentage)
- Garbage collection count and time spent (when applicable)
- Disk space used (total and percentage per partition /local, /tmp etc)
- Number of open file descriptors
- Inode usage percentage
- Active hosts versus total hosts in host class/fleet
- Total hosts versus available (e.g., AWS EC2 has limits per account)
Modern server-side applications largely depend on external services. Think of your payment processing system, Single Sign-on(SSO) authentication or advertisement APIs. However, even old-fashioned monolithic services usually consist of a separate database.
While a specific dependency may have a unique set of domain-specific metrics, make sure to start to monitor the least common denominator:
- Availability (e.g., errors, timeouts)
- Latency (mean, p99 )
- Throughput for Reads and Writes (mean, p99)
A vast majority of services depend on external data storages, either for persistence or for caching. Thus depending on a kind of data-storage (e.g., managed NoSQL, self-hosted SQL database) consider the following set of metrics when implementing monitoring:
- Provisioned and used capacity
- Throttling rate
- Input/Output Operations per Second (IOPs)
- CPU Utilization
- Used memory and storage (total and percentage)
- Number of DB connections
- Replication Lag time or size
Additionally, for cloud services, your cloud platform provider itself is a critical dependency. Make sure, you are firmly watching its health dashboards:
Finally, the last pillar of service monitoring — network monitoring.
From a service monitoring perspective, we are primarily interested in whether we hit a bandwidth limit or the maximum number of open connections. Both bottlenecks may have different flavours and come from different parts of your network infrastructure: host-level, load balancer or NAT gateway. Thus make sure you know limits of your hardware or IaaS provider and when applicable consider the following metrics for each networking device:
- Open-File Descriptors in OS
- In/Out Bits Per Sec
- Active Connection Count
- Load Balancer Spillover
- Load Balancer Surge Queues
Additionally, consider integrating on-premises or cloud DDoS detection and mitigation services, such as AWS Shield or Azure DDoS Protection. That would monitor and protect your network at multiple OSI layers against flood, reflective attacks, and resource exhaustion.
We have finally discussed the 4 pillars of successful service monitoring. The provided framework is only a minimum set of recommendations and provides a good foundation, but is not exhaustive by any means. For example, it does not address advanced topics such as real-time security policy monitoring or distributed application debugging and analysis (AWS X-Ray). Your next steps should be to implement real-time dashboards around your metrics and automate alarming based on thresholds and anomaly detection.