The Fire Behind the Green
Has it ever happened to you that even though your dashboard is green, you have incoming support tickets flowing in from a specific faction of your users? You spend hours digging up and find out that Android users using a particular app_version in India are getting affected. Well, you are not alone!
Often teams just monitor the top level metric and assume they have enough observability into the system until customers start getting affected while the overall dashboard remains flat. This often increases the MTTD (Mean Time To Detect), which in turn results in a higher MTTR (Mean Time To Resolve) costing businesses millions. This leads to reactive investigations, stressful debugging and war rooms which we all agree isn’t a pretty situation. This article is a guide on multidimensional Anomaly Detection and how to manage real-time production systems at scale. Multidimensional anomaly detection essentially means monitoring not just the overall metric but multiple dimensional slices of the metric as independent time series.
Terminology and Absolute Basics
Maintaining reliable production systems is crucial. This is turn requires observability on key system metrics. Automated monitoring using anomaly detection is a central piece that ensures that the system is being investigated in case of any outlier behavior. Since most metrics are emitted across time and the branch of analysis is commonly referred to as Time Series Anomaly Detection.
Time series metrics offer massive diversity in terms of metric behavior. For example, an ad revenue fluctuation might look very different from, say, an e-commerce metric tracking shipment delays. Therefore, the definition of an anomaly or outlier can be significantly different across domains and often require specific context and domain knowledge to capture the right issues with high precision and high recall.
Multidimensional Anomaly Detection involves monitoring the same metric over different values of a dimension or combination of dimensions. For example, a metric like pageviews may have a dimension country which may have different values like US, India, Germany. The behavior of pageviews | country=US and pageviews | country=IN may be quite different.
Problem Space
Visibility on the top level metrics alone is not enough. In production systems, you can encounter issues in specific use cases that could be in specific factions of your user base. The literal opposite to that is monitoring metrics on every dimension slice. This is not the solution either since this is not practical for even small use cases. It is also perhaps not necessary in order to ensure reliability, especially to the users you care about. In this section, I’ll spend some time defining the problem space based on the underlying business.
Metric Dimensionality
Consider a metric nSucessResponses that captures the number of successful responses to API requests. In the fig below, the overall metric behaves very differently than its slices. This could be because
- Some of the factions are much more dominant than others. For e.g.: 1 of the APIs gets called more often
- The behavior of the cumulative metric is different. For e.g. 2 or more APIs with similar cardinality is getting averaged out during sum
In this example, monitoring the slices are essential if your underlying slices have different behavior or simply are at different scale.
Combinatorial explosion of Dimensions
If you monitor every combination, the space explodes: |geo| × |device| × |version|. Even modest cardinalities of dimensions can compound quickly to hundreds of thousands of unique time series making it a scalability nightmare. For example: 20 geos × 5 device types × 10 app versions × 3 tiers amounts to 3,000 dimensions combinations.
Outlier Type
Metric Behavior heavily determines the type of detection required.
For example
- For flat, bounded metrics like
cpuorerror_rate- Simple thresholds or mean ± kσ works great
- For seasonal metrics like
num_signupsor traffic- Percentage/absolute change vs baseline is often very helpful.
- Usually absolute numbers make more sense for lower volumes
- For Trending + seasonal metrics like
revenue,DAU- Forecasting-based detectors (ETS/MSTL style) with prediction bands offer a much clearer perspective
Note that there is also the performance aspect of detection algorithms when it comes to scaling across time series. For example, scaling a threshold detector across multiple timeseries may be much easier than scaling a matrix profile detector given the history and computation involved.
Approach
The above problem can be broken down into key areas that can be tackled individually.
- Observability Pipeline: This is the backbone of the system that allows a reliable way of capturing metrics from production to a suitable DB for analytics purposes
- Anomaly Detection: This is the core consumer of the pipeline where time series can be analyzed using various detectors which may be useful for different use cases
- Scaling Multidimensional Alerting: In order to monitor multiple slices of the time series, we need to ensure performance not just on the anomaly detection part of it but also it involves the right data modeling and DB tweaks to ensure performant queries
- Choosing the dimensions to monitor: This is about understanding the business use case and deciding on what to monitor given the combinatorial natural of the problem. Note that even if we solve scale in terms of queries and alerting, responding to those alerts is still often a manual step and thus, is very expensive.
Let’s start with the data pipeline.
Observability Pipeline
Production metric systems typically use a producer-consumer mechanism or equivalent where applications can emit events to a topic/stream which is consumed downstream in a metrics database. An example architecture is shared below.
Disclosure: I work at Startree Inc on Apache Pinot and Startree ThirdEye, which is why I’m using it here as an example implementation. The concepts directly extend to other stacks too.
The above is an example of a metrics pipeline where the high level components are
- Apache Kafka: Applications would typically emit to work off the stream, which would then be consumed downstream by an OLAP Store. Kafka is a production grade pub sub system that is commonly used in the industry for this use case that offers very high scalability and availability. Google Pub Sub would be another example of such a system.
- Apache Pinot: Apache Pinot is an OLAP store that allows efficient ingestion of this high volume data. It also has neat indexing features that allows fast querying and aggregations of data. Apache Druid and Clickhouse are possible alternatives that have similar features
- Anomaly Detection: This is typically a system that can query Pinot for time series dataframes and run anomaly detection algorithms.
- Notification: Once anomalies are detected, the system pushes them to a notification channel like slack, email publishers that can report stakeholders of incidents accordingly
- Root Cause Analysis: This is the stage after an incident is detected where stakeholders can inspect and drill down into the data for additional analysis to do root cause analysis of the incident.
Next, I want to talk about the anomaly detection piece since it is the primary consumer of the data for this use case.
Anomaly Detection
The detection itself is usually another data pipeline within the application. Using Startree Thirdeye as an example, the simplified architecture looks something like this.
There are a few things worth calling out in the above pipeline
- The Anomaly Detector is an interface that allows both internal and external detector implementations to be used easily used.
- Data Preprocessing Step ensures that the Anomaly Detector receives a clean and consistent time series to work with. This may involve transforming, cleaning or interpolating the data.
- The Anomaly Post processing step is useful to identify a continuing anomaly over a period of time, where it can be merged with a previous anomaly that was already reported. In other cases, it may also be ignored based on alert configuration.
The alert configuration is determined by the timeseries and the detector. Here is an example of a global alert on signup counts on a daily basis.
{
"name": "DAU ETS Alert - Global",
"description": "Alert if total daily active users deviate from an ETS forecast (all users, all regions, all devices)",
"template": {
"name": "startree-ets"
},
"templateProperties": {
"dataSource": "pinot-prod",
"dataset": "user_activity_events",
"aggregationColumn": "user_id",
/* in prod, prefer approx counts DISTINCTCOUNTULL/HLL over accurate DISTINCTCOUNT for perf reasons */
"aggregationFunction": "DISTINCTCOUNTULL",
"monitoringGranularity": "PT1D",
/* ETS-specific tuning */
"seasonalityPeriod": "P7D", // weekly seasonality
"lookback": "P30D", // use last 30 days to train
"sensitivity": "1" // tighten/loosen as needed
}
}
In this alert, we are using a trend seasonality based ETS alert to track the overall DAU.
Multidimensional Alerts
We can monitor slices of the global metric to better understand the performance of a faction of interest. Let’s say want to monitor the daily active users (DAU) growth over time in following factions
country=US,os=iOScountry=IN,os=Androidcountry=DE
This is essentially a WHERE clause on the time series query assuming SQL.
Most modern anomaly detection platforms including Startree Thirdeye support multidimensional alerting. Here is a sample configuration for a multidimensional alert modeling the above scenario.
{
"name": "DAU ETS Alert - Key Slices",
"description": "Alert if daily active users deviate from an ETS forecast for key country/device slices",
"template": {
"name": "startree-ets-dx"
},
"templateProperties": {
"dataSource": "pinot-prod",
"dataset": "user_activity_events",
"aggregationColumn": "user_id",
"aggregationFunction": "DISTINCTCOUNTHLL",
"monitoringGranularity": "PT1D",
/* ETS-specific tuning */
"seasonalityPeriod": "P7D",
"lookback": "P30D",
"sensitivity": "1",
/* Dimension exploration */
"queryFilters": "${queryFilters}",
"enumerationItems": [
{
"name": "US-iOS",
"params": {
"queryFilters": " AND country='US' AND device='iOS'"
}
},
{
"name": "IN-Android",
"params": {
"queryFilters": " AND country='IN' AND device='Android'"
}
},
{
"name": "DE-AllDevices",
"params": {
"queryFilters": " AND country='DE'"
}
}
]
}
}
The underlying execution engine follows a Directed Acyclic Graph (DAG) architecture. The core pipeline still involves the simple alert pipeline with the DataFetcher → Anomaly Detector while being wrapped in a fork join driven loop.
Managing Multidimensional Alerts in Prod
Multidimensional alerting can quickly scale up and become challenging to operate. Here are some common issues and how to work around them.
- Which Cohorts to monitor? Start small. 5–20 slices for critical cohorts following the classic 80/20 rule. Once a basic system is working well, you can look at different filtration patterns. One simple way of filtering is to always monitor against primary dimensions like say
country,OS, etc which usually have a dedicated business component associated to it. Another way is to mathematically approach this using a business objective function and optimizing the dimension tree based on constraints like depth of tree, etc. We found these to be extremely used features in Thirdeye but generating them externally and feeding them in should not be a big issue. - Min volume filters: Ignore slices below some threshold (e.g., <50 events per window). To begin with, focus on the cohorts that matter to the business
- Dedup / aggregation: Don’t page separately for every tiny child slice when a parent slice is clearly broken.
- Expectations from the data stack: time-bucketed group-by, filtered queries, basic throughput. To keep these queries fast at scale, you’ll want time bucketing, pre-aggregations, and indexing in your OLAP store. There are various strategies to make this work and we may be exploring time series perf optimization as a separate article.
Wrap up
We all love to show green dashboards to our executives. However, the main objective must always be the quality of the actual product/service that we ship. Observability and especially, granular observability is key to make sure that we put out the candles before they cause widespread fires.
In this article, we walked through:
- Why overall metrics are not enough and how dimensionality can create blind spots.
- How different metric behaviors need different detectors which may range from simple thresholds for flat signals to ETS/MSTL detectors for trending, seasonal KPIs.
- What a practical setup looks like: a real observability pipeline (Kafka → Pinot → ThirdEye), a global ETS alert on DAU, and then a multidimensional ETS alert that monitors a few high-value slices with a single configuration.
- How to avoid operational pain by starting with a small, curated slice set, enforcing min-volume filters, deduping noisy child slices, and having realistic expectations from your data stack.
If there’s one takeaway, it’s this:
Don’t wait for your users to tell you that your system is broken. Be proactive and granular with monitoring to make sure you have eyes and ears at the right places.
From here, there are two natural next steps:
- Go deeper on performance. How do you keep per-slice queries fast when you have high cardinality and tight SLAs? That’s where time bucketing, pre-aggregations, and indexing strategies in your OLAP store come in.
- Get smarter about which slices to watch. Instead of hand-picking specific cohorts, you can use objective functions and dimension trees to prioritize slices under compute and alert budgets.
Please thumbs up/let me know if above topics would be helpful and I can cover them in subsequent follow ups.
Disclosure: I work at Startree Inc on Apache Pinot and Startree ThirdEye.
