Logging is arguably the most important element of your observability solution. Logs provide foundational and rich information about system behavior. In an ideal world, you would make all the decisions about logging and implement a consistent approach across your entire system.
However, in the real world, you might work with legacy software or deal with different programming languages, frameworks, and open-source packages, each with its own format and structure for logging.
With such a diversity in log formats across your system, what steps can you take to extract the most value from all your logs? That’s what we’ll cover in this post.
We’ll look at how logs can be designed, the challenges and solutions to logging in large systems, and how to think about log-based metrics and long-term retention.
Let’s dive in with a look at log levels and formats.
Many considerations go into log design, but the two most important aspects are the use of log levels and whether to use structured or unstructured log formats.
Log levels are used to categorize log messages based on their severity. Specific log levels used may vary depending on the logging framework or system. However, commonly used log levels include (in order of verbosity, from highest to lowest):
Logging at the appropriate level helps with understanding the system's behavior, identifying issues, and troubleshooting problems effectively.
When it comes to system components that you build, we recommend that you devote some time to defining the set of log levels that are useful. Understand what kinds of information should be included in messages at each log level, and use the log levels consistently.
Later, we’ll discuss how to deal with third-party applications, where you have no control over the log levels. We’ll also look at legacy applications that you control but are too expansive to migrate to the standard log levels.
Entries in structured logs have a well-defined format, usually as key-value pairs or JSON objects. This allows for consistent and machine-readable log entries, making it easier to parse and analyze log data programmatically.
Structured logging enables advanced log querying and analysis, making it particularly useful in large-scale systems.
On the other hand, unstructured (free-form) logging captures messages in a more human-readable format, without a predefined structure. This approach allows developers to log messages more naturally and flexibly.
However, programmatically extracting specific information from the resulting logs can be very challenging.
Choosing between structured and unstructured logs depends on your specific needs and the requirements and constraints of your system. If you anticipate the need for advanced log analysis or integration with log analysis tools, structured logs can provide significant benefits.
However, if all you need is simplicity and readability, then unstructured logs may be sufficient.
In some cases, a hybrid approach can also be used, where you use structured logs for important events and unstructured logs for more general messages.
For large-scale systems, you should lean towards structured logging when possible but note that this adds another dimension to your planning. The expectation for structured log messages is that the same set of fields will be used consistently across system components. This will require strategic planning.
With systems comprising multiple components, each component will most likely have its own model to manage its logs. Let’s review the challenges this brings.
Components will log to different destinations—files, system logs, stdout, or stderr. In distributed systems, collecting these scattered logs for effective use is cumbersome.
For this, you’ll need a diversified approach to log collection, such as using installed collectors and hosted collectors from Sumo Logic.
Some components will use unstructured, free-form logging, not following any format in particular. Meanwhile, structured logs may be more organized, but components with structured logs might employ completely different sets of fields.
Unifying the information you get from a diversity of logs and formats requires the right tools.
Components in your system might use different ranges of log levels. Even if you consolidate all log messages into a centralized logging system (as you should), you will need to deal with the union of all log levels.
One challenge that arises is when different log levels ought to be treated the same. For example, ERROR in one component might be the same as CRITICAL in another component, requiring immediate escalation.
You face the opposite challenge when the same log level in different components means different things. For example, INFO messages in one component may be essential for understanding the system state, while in another component they might be too verbose.
Large distributed systems accumulate a lot of logs. Collecting and storing these logs isn’t cheap. Log-related costs in the cloud can make up a significant portion of the total cost of the system.
While the challenges of logging in large, distributed systems are significant, solutions can be found through some of the following practices.
When you run a distributed system, you should use a centralized logging solution. As you run log collection agents on each machine in your system, these collectors will send all the logs to your central observability platform.
Sumo Logic, which has always focused on log management and analytics, is best in class when it comes to log aggregation.
Dealing with logs in different formats is a big problem if you want to correlate log data for analytics and troubleshooting across applications and components. One solution is to transform different logs into a unified format.
The level of effort for this task can be high, so consider doing this in phases, starting with your most essential components and working your way down.
For your own applications, work to establish a standard logging approach that adopts a uniform set of log levels, a single structured log format, and consistent semantics.
If you also have legacy applications, evaluate the level of risk and cost associated with migrating them to adhere to your standard.
If a migration is not feasible, treat your legacy applications like you would third-party applications.
Enriching logs from third-party sources involves enhancing log data with contextual information from external systems or services. This brings a better understanding of log events, aiding in troubleshooting, analysis, and monitoring activities.
To enrich your logs, you can integrate external systems (such as APIs or message queues) to fetch supplementary data related to log events (such as user information, customer details, or system metrics).
Carefully managing log volume, frequency, and retention is crucial for efficient log management and storage.
Metrics that are derived from analyzing log data can provide insights into system behavior and performance. Working log-based metrics has its benefits and challenges.
Defining meaningful metrics: Because the set of metrics available to you across all your components is incredibly vast—and it wouldn’t make sense to capture them all—identifying which metrics to capture and extract from logs can be a complex task.
This identification requires a deep understanding of system behavior and close alignment with your business objectives.
Data extraction and parsing: Parsing logs to extract useful metrics may require specialized tools or custom parsers. This is especially true if logs are unstructured or formatted inconsistently from one component to the next.
Setting this up can be time-consuming and may require maintenance as log formats change or new log sources emerge.
After you’ve made the move toward log aggregation in a centralized system, you will still need to consider long-term log retention policies. Let’s cover the critical questions for this area.
How long you should keep a log around depends on several factors, including:
Deleting old logs is, of course, the simplest way to reduce your storage costs. However, it may be a bit heavy-handed, and you sometimes may want to keep information from old logs around.
When you want to keep information from old logs, but also want to be cost-efficient, consider taking some of these measures:
In this article, we’ve looked at how to get the most out of logging in large-scale systems.
Although logging in these systems presents a unique set of challenges, we’ve looked at potential solutions to these challenges, such as log aggregation, transforming logs to a unified format, and enriching logs with data from third-party sources.
Logging is a critical part of observability. By following the practices outlined in this article, you can ensure that your logs are managed effectively, enabling you to troubleshoot problems, identify issues, and gain insights into the behavior of your system.
And you can do this while keeping your logging costs at bay.
Also published here