A distributed architecture brings in several challenges when it comes to operability and monitoring. Here, one may be dealing with tens if not hundreds of microservices, each of which may or may not have been built by the same team.
With a large-scale system, it is very important to have eyes on key system metrics, application health and enough data to be able to track down issues quickly and mitigate them. Being in a CLOUD environment such as AWS, Google Cloud, Azure further accentuate the problem and makes it difficult to detect, troubleshoot and isolate an issue due to the dynamic nature of the infrastructure (scale up / down, temporal machines, dynamic IPs etc.).
KEY PILLARS OF OBSERVABILITY:
With the distributed architecture it has become a standard practice to use centralized logging systems to be able to gather logs and visualize them. Logs generated by different microservice will be automatically shipped to a central location for storage and analysis. Some of the popular centralized logging solutions for application logs are :
Logs provide useful information about events that have occurred within your application. These could be INFO level logs or error logs with detailed stack traces around exceptions.
A centralized logging system in a large distributed application may be ingesting Gigabytes of data per hour, if not more. Given that a request may be travelling through multiple microservices, one way to query all logs related to a request spanning several microservices is to attribute each request with some kind of unique identifier.
In most cases this could be a userId associated with the request or some sort of unique UUID generated at the first point of ingress into the application. These identifiers will be attached to each and every log message and are propagated downstream from one microservice to another via request headers (in case the identifier is not part of the request being made downstream). Now one can easily use the requestId or userId to query the logging system to find all logs associated with the request across multiple services !!!
Fig 1: Centralized Logging
Below are few examples of how to tag your logs with required information in Java using Request Filters.
Fig2 : Log4J2 Config & Sample Log
Fig 3: Request Filters with UUID or UserId
Tracing is a method which allows one to profile and monitor applications while they are running. Traces provide useful information such as:
Below is a sample trace for a request interacting with two microservices (ad-bidding-service and ad-integrator service).
Fig 4: Tracing
In the above example the data was captured and visualized via DataDog tool. There are several other ways to capture traces which I will highlight in the upcoming section.
A trace is composed of a tree like structure with parent trace and child spans. A request trace has scope over several services and is further broken down into smaller chunks by operations/functionality known as spans. For example, a call from one microservice to another encompasses a span. Within a microservice there may be multiple spans( depending on how many layers of classes/function or dependent microservices are called to serve the request).
Tracing works by generating a unique id per request at the point of ingress and propagating that to downstream systems as trace-context in request headers. This allows the different trace information emitted from several services to be tied together at a central location for analysis and visualization.
Now that we are able to search logs by userId or some other unique identifier (such as generated UUID) and are able to trace the performance/behavior of a single request. Would it not be great if we are able to tie these together and be able to correlate logs and traces for a single request !!
Having this correlation between logs and request allows one to:
APPROACH #1: INSTRUMENTING WITH 3RD PARTY TOOLS SUCH AS DATADOG
Reference: DataDog APM
In this approach we are instrumenting the services in the distributed systems with DataDog APMs (application performance monitors). Datadog does 100% tracing of requests and also is able to collect logs emitted by your applications.
Datadog essentially takes care of all your centralized logging and trace information collection. Datadog generates unique traceIds and propagates the same across all the instrumented downstream microservices automatically. The only thing we need to do is attribute the DD traceId with the logs and we are able to achieve log-trace correlation.
Fig 6: Application Instrumentation Using DataDog
Fig 7: Log-Trace Correlation as seen in DataDog
APPROACH #2: ZIPKINS, CLOUD-SLEUTH WITH SPRING BOOT APPLICATION
Reference:
Advantages:
Disadvantages:
Design:
Fig 8: Zipkins, Cloud Sleuth and Spring Boot Application
APPROACH #3: AMAZON XRAY
Reference: Amazon XRAY
Advantages:
Disadvantages:
APPROACH #4: JAGER
Reference: Jager
Advantages:
Disadvantages:
More complicated from infrastructure maintenance perspective and setup
Logs and traces are definitely helpful by themselves. But when tied together via log-trace correlation they become a powerful tool to expedite troubleshooting of issue in PRODUCTION environment and at the same time giving operations and development time insight into the health, performance and behavior of the distributed systems. As described about there are several ways to implement the solution. Pick your poison :-)