Logging is arguably the most important element of your observability solution. Logs provide foundational and rich information about system behavior. In an ideal world, you would make all the decisions about logging and implement a consistent approach across your entire system. However, in the real world, you might work with legacy software or deal with different programming languages, frameworks, and open-source packages, each with its own format and structure for logging. With such a diversity in log formats across your system, what steps can you take to extract the most value from all your logs? That’s what we’ll cover in this post. We’ll look at how logs can be designed, the challenges and solutions to logging in large systems, and how to think about log-based metrics and long-term retention. Let’s dive in with a look at log levels and formats. Logging Design Many considerations go into log design, but the two most important aspects are the use of log levels and whether to use structured or unstructured log formats. Log Levels Log levels are used to categorize log messages based on their severity. Specific log levels used may vary depending on the logging framework or system. However, commonly used log levels include (in order of verbosity, from highest to lowest): : Captures every action the system takes, for reconstructing a comprehensive record and accounting for any state change. TRACE : Captures detailed information for debugging purposes. These messages are typically only relevant during development and should not be enabled in production environments. DEBUG : Provides general information about the system's operation to convey important events or milestones in the system's execution. INFO : Indicates potential issues or situations that might require attention. These messages are not critical but should be noted and investigated if necessary. WARNING : Indicates errors that occurred during the execution of the system. These messages typically highlight issues that need to be addressed and might impact the system's functionality. ERROR Logging at the appropriate level helps with understanding the system's behavior, identifying issues, and troubleshooting problems effectively. When it comes to system components that you build, we recommend that you devote some time to defining the set of log levels that are useful. Understand what kinds of information should be included in messages at each log level, and use the log levels consistently. Later, we’ll discuss how to deal with third-party applications, where you have no control over the log levels. We’ll also look at legacy applications that you control but are too expansive to migrate to the standard log levels. Structured Versus Unstructured Logs Entries in structured logs have a well-defined format, usually as key-value pairs or JSON objects. This allows for consistent and machine-readable log entries, making it easier to parse and analyze log data programmatically. Structured logging enables advanced log querying and analysis, making it particularly useful in large-scale systems. On the other hand, unstructured (free-form) logging captures messages in a more human-readable format, without a predefined structure. This approach allows developers to log messages more naturally and flexibly. However, programmatically extracting specific information from the resulting logs can be very challenging. Choosing between structured and unstructured logs depends on your specific needs and the requirements and constraints of your system. If you anticipate the need for advanced log analysis or integration with log analysis tools, structured logs can provide significant benefits. However, if all you need is simplicity and readability, then unstructured logs may be sufficient. In some cases, a hybrid approach can also be used, where you use structured logs for important events and unstructured logs for more general messages. For large-scale systems, you should lean towards structured logging when possible but note that this adds another dimension to your planning. The expectation for structured log messages is that the same set of fields will be used consistently across system components. This will require strategic planning. Logging Challenges With systems comprising multiple components, each component will most likely have its own model to manage its logs. Let’s review the challenges this brings. Disparate Destinations Components will log to different destinations—files, system logs, stdout, or stderr. In distributed systems, collecting these scattered logs for effective use is cumbersome. For this, you’ll need a diversified approach to log collection, such as using and from Sumo Logic. installed collectors hosted collectors Varying Formats Some components will use unstructured, free-form logging, not following any format in particular. Meanwhile, structured logs may be more organized, but components with structured logs might employ completely different sets of fields. Unifying the information you get from a diversity of logs and formats requires the right tools. Inconsistent Log Levels Components in your system might use different ranges of log levels. Even if you consolidate all log messages into a centralized logging system (as you should), you will need to deal with the union of all log levels. One challenge that arises is when different log levels ought to be treated the same. For example, ERROR in one component might be the same as CRITICAL in another component, requiring immediate escalation. You face the opposite challenge when the same log level in different components means different things. For example, INFO messages in one component may be essential for understanding the system state, while in another component they might be too verbose. Log Storage Cost Large distributed systems accumulate a lot of logs. Collecting and storing these logs isn’t cheap. Log-related costs in the cloud can make up a significant portion of the total cost of the system. Dealing With These Challenges While the challenges of logging in large, distributed systems are significant, solutions can be found through some of the following practices. Aggregate Your Logs When you run a distributed system, you should use a centralized logging solution. As you run log collection agents on each machine in your system, these collectors will send all the logs to your central observability platform. Sumo Logic, which has always focused on , is best in class when it comes to log aggregation. log management and analytics Move Toward a Unified Format Dealing with logs in different formats is a big problem if you want to correlate log data for analytics and troubleshooting across applications and components. One solution is to transform different logs into a unified format. The level of effort for this task can be high, so consider doing this in phases, starting with your most essential components and working your way down. Establish a Logging Standard Across Your Applications For your own applications, work to establish a standard logging approach that adopts a uniform set of log levels, a single structured log format, and consistent semantics. If you also have legacy applications, evaluate the level of risk and cost associated with migrating them to adhere to your standard. If a migration is not feasible, treat your legacy applications like you would third-party applications. Enrich Logs From Third-Party Sources Enriching logs from third-party sources involves enhancing log data with contextual information from external systems or services. This brings a better understanding of log events, aiding in troubleshooting, analysis, and monitoring activities. To enrich your logs, you can integrate external systems (such as APIs or message queues) to fetch supplementary data related to log events (such as user information, customer details, or system metrics). Manage Log Volume, Frequency, and Retention Carefully managing log volume, frequency, and retention is crucial for efficient log management and storage. : Monitoring generated log volume helps you control resource consumption and performance impacts. Volume : Determine how often to log, based on the criticality of events and desired level of monitoring. Frequency : Define a log retention policy appropriate for compliance requirements, operational needs, and available storage. Retention : Periodically archive or purge older log files to manage log file sizes effectively. Rotation : Compress log files to reduce storage requirements. Compression Log-Based Metrics Metrics that are derived from analyzing log data can provide insights into system behavior and performance. Working log-based metrics has its benefits and challenges. Benefits : Log-based metrics provide detailed and granular insights into system events, allowing you to identify patterns, anomalies, and potential issues. Granular insights : By leveraging log-based metrics, you can monitor your system comprehensively, gaining visibility into critical metrics related to availability, performance, and user experience. Comprehensive monitoring : Log-based metrics provide historical data that can be used for trend analysis, capacity planning, and performance optimization. By examining log trends over time, you can make data-driven decisions to improve efficiency and scalability. Historical analysis : You can tailor your extraction of log-based metrics to suit your application or system, focusing on the events and data points that are most meaningful for your needs. Flexibility and customization Challenges : Because the set of metrics available to you across all your components is incredibly vast—and it wouldn’t make sense to capture them all—identifying which metrics to capture and extract from logs can be a complex task. Defining meaningful metrics This identification requires a deep understanding of system behavior and close alignment with your business objectives. : Parsing logs to extract useful metrics may require specialized tools or custom parsers. This is especially true if logs are unstructured or formatted inconsistently from one component to the next. Data extraction and parsing Setting this up can be time-consuming and may require maintenance as log formats change or new log sources emerge. : Delays in processing log-based metrics can lead to outdated or irrelevant metrics. For most situations, you will need a platform that can perform fast, real-time processing of incoming data in order to leverage log-based metrics effectively. Need for real-time analysis : Continuously capturing component profiling metrics places additional strain on system resources. You will need to find a good balance between capturing sufficient log-based metrics and maintaining adequate system performance. Performance impact : Log data often includes a lot of noise and irrelevant information, not contributing toward meaningful metrics. Careful log filtering and normalization are necessary to focus data gathering on relevant events. Data noise and irrelevance Long-Term Log Retention After you’ve made the move toward log aggregation in a centralized system, you will still need to consider long-term log retention policies. Let’s cover the critical questions for this area. How Long Should You Keep Logs Around? How long you should keep a log around depends on several factors, including: : Some logs (such as access logs) can be deleted after a short time. Other logs (such as error logs) may need to be kept for a longer time in case they are needed for troubleshooting. Log type : Industries like healthcare and finance have regulations that require organizations to keep logs for a certain time, sometimes even a few years. Regulatory requirements : Your company may have policies that dictate how long logs should be kept. Company policy : If your logs are large, you may need to rotate them or delete them more frequently. Log size : Regardless of where you store your logs—on-premise or in the cloud—you will need to factor in the cost of storage. Storage cost How Do You Reduce the Level of Detail and Cost of Older Logs? Deleting old logs is, of course, the simplest way to reduce your storage costs. However, it may be a bit heavy-handed, and you sometimes may want to keep information from old logs around. When you want to keep information from old logs, but also want to be cost-efficient, consider taking some of these measures: : In the case of components that generate many repetitive log statements, you might ingest only a subset of the statements (for example, 1 out of every 10). Downsampling logs : For logs with large messages, you might discard some fields. For example, if an error log has an error code and an error description, you might have all the information you need by keeping only the error code. Trimming logs : You can compress old logs and move them to cheaper and less accessible storage (especially in the cloud). This is a great solution for logs that you need to store for years to meet regulatory compliance requirements. Compression and archiving Conclusion In this article, we’ve looked at how to get the most out of logging in large-scale systems. Although logging in these systems presents a unique set of challenges, we’ve looked at potential solutions to these challenges, such as log aggregation, transforming logs to a unified format, and enriching logs with data from third-party sources. Logging is a critical part of observability. By following the practices outlined in this article, you can ensure that your logs are managed effectively, enabling you to troubleshoot problems, identify issues, and gain insights into the behavior of your system. And you can do this while keeping your logging costs at bay. Also published here

The Path to a Seamless Web3: Account Abstraction from Flow (Part 1)

Building Your Own Crowdfunding Dapp Using Infura and Linea

Nominated for 2022 - HackerNoon Contributor of the Year - Heroku

Nominated for 2022 - No No No Nodejs

Nominated for 2022 - HackerNoon Contributor of the Year - Jobs

Nominated for 2022 - HackerNoon Contributor of the Year - Npm

Nominated for 2022 - HackerNoon Contributor of the Year - Kubernetes

Nominated for 2022 - HackerNoon Contributor of the Year - Engineering

Too Long; Didn't Read

How to Extract the Maximum Value From Logs

How to Extract the Maximum Value From Logs

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Breaking Free From Going Solo: Learning to Collaborate on API Development with Postman

10 Reasons Why Less Is More in Your init/deinit Methods

10 Best Practices for Using Kubernetes Network Policies

10 Best Practices for Every React Developer

12 Essential Coding Standards for Quality Web Development

Breaking Free From Going Solo: Learning to Collaborate on API Development with Postman

10 Reasons Why Less Is More in Your init/deinit Methods

10 Best Practices for Using Kubernetes Network Policies

10 Best Practices for Every React Developer

12 Essential Coding Standards for Quality Web Development

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps