Limitation of the Traditional Monitoring The management of modern distributed applications has become increasingly complex. Using traditional monitoring tools, which rely mainly on manual analysis, is insufficient for ensuring the availability and performance demanded by microservices or serverless topologies. One of the main problems with traditional monitoring is the high volume and variety of telemetry data generated by IT environments. This includes metrics, logs, and traces, which in an ideal world should be consolidated on a single monitoring dashboard to allow observation of the entire system. Another problem is static thresholds for alarms. Setting them too low will generate a high volume of false positives, while setting them too high will fail to detect significant performance degradation. To solve these problems, organizations are shifting to an intelligent, automated, and predictive solution known as AIOps. Instead of relying on human operators to manually connect the dots, AIOps platforms are designed to ingest and analyze these vast datasets in real time. In this article, we will learn how AIOps platforms are capable of proactive anomaly detection—its most fundamental capability - as well as root cause analysis, prediction, and alert generation. The Technology Stack The solution detailed in this article is a combination of three synergistic pillars: A managed AIOps platform that provides analytical intelligence. We will use AWS Guru, which is the core of our solution and acts as its "AIOps brain." AWS Guru is a managed service that leverages machine learning models built and trained by AWS experts. A key design principle is to make AIOps accessible to specialists without special machine learning expertise. Its primary function is to detect operational issues or anomalies and produce high-level insights instead of a stream of raw, uncorrelated alerts. These insights include related log snippets, a detailed analysis with a possible root cause, and actionable steps to diagnose and remediate the issue. An Open-Standard observability framework that supplies high-quality telemetry data and provides a unified set of APIs, SDKs, and tools to generate, collect, and export it. The importance of OpenTelemetry lies in two principles: standardization and vendor neutrality. The benefit of using OpenTelemetry is that if we want to switch to a different AIOps tool, we can just redirect the telemetry stream. A Serverless Application that is an example of a modern and dynamic microservice topology. A managed AIOps platform that provides analytical intelligence. We will use AWS Guru, which is the core of our solution and acts as its "AIOps brain." AWS Guru is a managed service that leverages machine learning models built and trained by AWS experts. A key design principle is to make AIOps accessible to specialists without special machine learning expertise. Its primary function is to detect operational issues or anomalies and produce high-level insights instead of a stream of raw, uncorrelated alerts. These insights include related log snippets, a detailed analysis with a possible root cause, and actionable steps to diagnose and remediate the issue. A managed AIOps platform An Open-Standard observability framework that supplies high-quality telemetry data and provides a unified set of APIs, SDKs, and tools to generate, collect, and export it. The importance of OpenTelemetry lies in two principles: standardization and vendor neutrality. The benefit of using OpenTelemetry is that if we want to switch to a different AIOps tool, we can just redirect the telemetry stream. An Open-Standard observability framework A Serverless Application that is an example of a modern and dynamic microservice topology. A Serverless Application The complete architectural solution for a proposed telemetry pipeline can be observed on the below diagram. Practical Implementation It’s important to understand that AWS Guru does not collect any telemetry data itself but is configured to monitor and continuously analyze resources produced by the Application and identified by specific tags. To give a reader a better understanding in this section we provide a comprehensive guide on how to implement the proposed solution and further in the Experiment section we will see on how to instrument it. The following structure of a git repository aligns with IAC best practices: . ├── demo │ ├── envs │ │ └── dev │ │ ├── env.hcl # Environment-specific configuration that sets the environment name │ │ ├── api_gateway │ │ │ └── terragrunt.hcl │ │ ├── devopsguru │ │ │ └── terragrunt.hcl │ │ ├── dynamodb │ │ │ └── terragrunt.hcl │ │ ├── iam │ │ │ └── terragrunt.hcl │ │ └── serverless_app │ │ └── terragrunt.hcl │ └── project.hcl # Project-level configuration defining `app_name_prefix` and `project_name` used across all environments ├── root.hcl # Root Terragrunt configuration that generates AWS provider blocks and configures S3 backend ├── src │ ├── app.py # Lambda handler function with OpenTelemetry instrumentation │ ├── requirements.txt │ └── collector.yaml └── terraform └── modules # Infrastructure Modules ├── api_gateway ├── devopsguru ├── dynamodb └── iam . ├── demo │ ├── envs │ │ └── dev │ │ ├── env.hcl # Environment-specific configuration that sets the environment name │ │ ├── api_gateway │ │ │ └── terragrunt.hcl │ │ ├── devopsguru │ │ │ └── terragrunt.hcl │ │ ├── dynamodb │ │ │ └── terragrunt.hcl │ │ ├── iam │ │ │ └── terragrunt.hcl │ │ └── serverless_app │ │ └── terragrunt.hcl │ └── project.hcl # Project-level configuration defining `app_name_prefix` and `project_name` used across all environments ├── root.hcl # Root Terragrunt configuration that generates AWS provider blocks and configures S3 backend ├── src │ ├── app.py # Lambda handler function with OpenTelemetry instrumentation │ ├── requirements.txt │ └── collector.yaml └── terraform └── modules # Infrastructure Modules ├── api_gateway ├── devopsguru ├── dynamodb └── iam This Modular (Terragrunt) Approach has the following Benefits: True environment isolation: each environment (dev, prod, etc.) has its own state, config, and outputs. All major AWS resources (Lambda, API Gateway, DynamoDB, IAM, DevOps Guru) are reusable Terraform modules in terraform/modules/. Easy to extend for new AWS services or environments with minimal duplication. This Modular (Terragrunt) Approach has the following Benefits: Terragrunt Benefits: True environment isolation: each environment (dev, prod, etc.) has its own state, config, and outputs. All major AWS resources (Lambda, API Gateway, DynamoDB, IAM, DevOps Guru) are reusable Terraform modules in terraform/modules/. Easy to extend for new AWS services or environments with minimal duplication. True environment isolation: each environment (dev, prod, etc.) has its own state, config, and outputs. dev prod All major AWS resources (Lambda, API Gateway, DynamoDB, IAM, DevOps Guru) are reusable Terraform modules in terraform/modules/. terraform/modules/ Easy to extend for new AWS services or environments with minimal duplication. The full repository can be found here: https://github.com/kirPoNik/aws-aiops-detection-with-guru The full repository can be found here: https://github.com/kirPoNik/aws-aiops-detection-with-guru https://github.com/kirPoNik/aws-aiops-detection-with-guru The Lambda function (code in app.py) receives requests from API Gateway, generates an unique ID and put an item to the Dynamo DB Table. It also contains the logic to inject a "gray failure", which will be required for our experiment, see the code snipped with the Key Logic below: app.py import os import time import random import boto3 import uuid # --- CONFIGURATION FOR GRAY FAILURE SIMULATION --- # This environment variable acts as our feature flag for the experiment INJECT_LATENCY = os.environ.get("INJECT_LATENCY", "false").lower() == "true" MIN_LATENCY_MS = 150 # Minimum artificial latency in milliseconds MAX_LATENCY_MS = 500 # Maximum artificial latency in milliseconds def handler(event, context): """ Handles requests and optionally injects a variable sleep to simulate performance degradation. """ # This is the core logic for our "gray failure" simulation if INJECT_LATENCY: latency_seconds = random.randint(MIN_LATENCY_MS, MAX_LATENCY_MS) / 1000.0 time.sleep(latency_seconds) # The function's primary business logic is to write an item to DynamoDB try: table.put_item( Item={ "id": str(uuid.uuid4()), "created_at": int(time.time()) } ) # ... returns a successful response ... except Exception as e: # ... returns an error response ... import os import time import random import boto3 import uuid # --- CONFIGURATION FOR GRAY FAILURE SIMULATION --- # This environment variable acts as our feature flag for the experiment INJECT_LATENCY = os.environ.get("INJECT_LATENCY", "false").lower() == "true" MIN_LATENCY_MS = 150 # Minimum artificial latency in milliseconds MAX_LATENCY_MS = 500 # Maximum artificial latency in milliseconds def handler(event, context): """ Handles requests and optionally injects a variable sleep to simulate performance degradation. """ # This is the core logic for our "gray failure" simulation if INJECT_LATENCY: latency_seconds = random.randint(MIN_LATENCY_MS, MAX_LATENCY_MS) / 1000.0 time.sleep(latency_seconds) # The function's primary business logic is to write an item to DynamoDB try: table.put_item( Item={ "id": str(uuid.uuid4()), "created_at": int(time.time()) } ) # ... returns a successful response ... except Exception as e: # ... returns an error response ... and the collector configuration ( in collector.yaml), that defines pipelines to send traces to AWS X-Ray and metrics to Amazon CloudWatch, see the Key Logic below: collector.yaml # This file configures the OTel Collector in the ADOT layer exporters: # Send trace data to AWS X-Ray awsxray: # Send metrics to CloudWatch using the Embedded Metric Format (EMF) awsemf: service: pipelines: # The pipeline for traces: receive data -> export to X-Ray traces: receivers: [otlp] exporters: [awsxray] # The pipeline for metrics: receive data -> export to CloudWatch metrics: receivers: [otlp] exporters: [awsemf] # This file configures the OTel Collector in the ADOT layer exporters: # Send trace data to AWS X-Ray awsxray: # Send metrics to CloudWatch using the Embedded Metric Format (EMF) awsemf: service: pipelines: # The pipeline for traces: receive data -> export to X-Ray traces: receivers: [otlp] exporters: [awsxray] # The pipeline for metrics: receive data -> export to CloudWatch metrics: receivers: [otlp] exporters: [awsemf] Simulating Failure and Generating Insights The Experiment section The Experiment section Step 1: Deploy the Stack Step 1: Deploy the Stack In the demo/envs/dev directory, run the usual commands: demo/envs/dev terragrunt init --all terragrunt plan --all terragrunt apply --all terragrunt init --all terragrunt plan --all terragrunt apply --all Grab the API endpoint from the output and save it. export API_URL=$(terragrunt output -json --all \ | jq -r 'to_entries[] | select(.key \ | test("api_endpoint")) | .value.value') export API_URL=$(terragrunt output -json --all \ | jq -r 'to_entries[] | select(.key \ | test("api_endpoint")) | .value.value') You need to enable AWS DevOps Guru and wait 15-90 minutes for Discovering applications and resources You need to enable AWS DevOps Guru and wait 15-90 minutes for Discovering applications and resources Discovering applications and resources Step 2: Establish a Baseline Step 2: Establish a Baseline DevOps Guru needs to learn what "normal" looks like. Let's give it some healthy traffic. We'll use hey, a simple load testing tool perfect for this job. hey Why hey? We could use a more complex tool like k6, which is great for scripting detailed user journeys. But for this test, we just need to hit an endpoint with a steady stream of requests. hey does that with a single command, keeping things simple. Why hey? We could use a more complex tool like k6, which is great for scripting detailed user journeys. But for this test, we just need to hit an endpoint with a steady stream of requests. hey does that with a single command, keeping things simple. hey Run a light load for a few hours. This gives the ML models plenty of data to build a solid baseline. # Run for 4 hours at 5 requests per second hey -z 4h -q 5 -m POST "$API_URL" # Run for 4 hours at 5 requests per second hey -z 4h -q 5 -m POST "$API_URL" Use GNU Screen to run this in background Use GNU Screen to run this in background Step 3: Inject the Failure Step 3: Inject the Failure Now for the fun part. We'll introduce our "gray failure" - a subtle slowdown that a simple threshold alarm would likely miss. In demo/envs/dev/serverless_app/terragrunt.hcl, add a new INJECT_LATENCY to our Lambda function's environment variable: demo/envs/dev/serverless_app/terragrunt.hcl INJECT_LATENCY environment_variables = { TABLE_NAME = dependency.dynamodb.outputs.table_name AWS_LAMBDA_EXEC_WRAPPER = "/opt/otel-instrument" OPENTELEMETRY_COLLECTOR_CONFIG_URI = "/var/task/collector.yaml" INJECT_LATENCY = "true" # <-- Change this to true } environment_variables = { TABLE_NAME = dependency.dynamodb.outputs.table_name AWS_LAMBDA_EXEC_WRAPPER = "/opt/otel-instrument" OPENTELEMETRY_COLLECTOR_CONFIG_URI = "/var/task/collector.yaml" INJECT_LATENCY = "true" # <-- Change this to true } Apply the change. This quick deployment is an important event that DevOps Guru will notice. terragrunt apply --all terragrunt apply --all Step 4: Generate Bad Traffic Step 4: Generate Bad Traffic Run the same load test again. This time, every request will have that extra, variable delay. # Run for at least an hour to generate enough bad data hey -z 1h -q 5 -m POST "$API_URL" # Run for at least an hour to generate enough bad data hey -z 1h -q 5 -m POST "$API_URL" Our app is now performing worse than its baseline. Let's see if DevOps Guru noticed. After 30-60 minutes of bad traffic, an "insight" popped up in the DevOps Guru console. This is the real value of AIOps. A standard CloudWatch alarm would have just said, "Latency is high." DevOps Guru said, "Latency is high, and it started right after you deployed this change." Conclusion This experiment shows a clear path away from reactive firefighting. By pairing a standard observability framework like OpenTelemetry with an AIOps engine like AWS DevOps Guru, we can build systems that help us find and fix problems before they become disasters. OpenTelemetry AWS DevOps Guru The big takeaway is correlation. The magic wasn't just spotting the latency spike; it was automatically linking it to the deployment. That's the jump from raw data to real insight. correlation The future of ops isn't about more dashboards. It's about fewer, smarter alerts that tell you what's wrong, why it's wrong, and how to fix it. Resources Github Repository: https://github.com/kirPoNik/aws-aiops-detection-with-guru AWS DevOps Guru Official Page OpenTelemetry Official Documentation: AWS Distro for OpenTelemetry (ADOT) for Lambda hey - HTTP Load Generator: Github Repository: https://github.com/kirPoNik/aws-aiops-detection-with-guru https://github.com/kirPoNik/aws-aiops-detection-with-guru AWS DevOps Guru Official Page AWS DevOps Guru Official Page AWS DevOps Guru Official Page OpenTelemetry Official Documentation: OpenTelemetry Official Documentation OpenTelemetry Official Documentation AWS Distro for OpenTelemetry (ADOT) for Lambda AWS Distro for OpenTelemetry (ADOT) for Lambda AWS Distro for OpenTelemetry (ADOT) for Lambda hey - HTTP Load Generator: hey - HTTP Load Generator hey - HTTP Load Generator