An Engineer’s Checklist of Logging Best Practices

https://www.honeycomb.io/blog/engineers-checklist-logging-best-practices

The best DevOps and SRE teams have shifted their approach to monitoring and logging their systems. These teams debug problems cohesively and rationally, regardless of the system’s complexity. Gone are the days of having a slew of logs that fail to explain the cause of alerts, system failures, and other unknowns. 

Implementing logging best practices is foundational for maintaining system integrity and performance in today’s complex IT environments. Effective logging not only streamlines troubleshooting by providing clear insights into errors and system behavior, but it also enhances performance monitoring by enabling the identification of bottlenecks and anomalies. Robust logging is also crucial to security, helping to detect and investigate potential threats or unauthorized access. 

Following this checklist of logging best practices will make your logging efficient, actionable, and scalable. 

Why is logging important?

Logs are records of events, actions, and messages generated by an application or underlying infrastructure system during operation. Log messages contain tons of useful information about a system, including a description of the event that happened, an associated timestamp, a severity-level type associated with the event, and other relevant metadata. Logs are valuable for debugging issues, diagnosing errors, and auditing system activity. They provide a textual narrative of system events, making understanding the sequence of actions taken more manageable.

10 logging best practices

1. Structure your logs

Unstructured logs can be difficult to parse, search, and analyze. Use structured logs (e.g., JSON or XML) to process logs programmatically, correlate them with other logs, and use them in monitoring and analysis tools. Structured logs are more readable and actionable, allowing developers and operations teams to identify key information quickly.

Example: log.json file

{
  "timestamp": "2024-09-18T12:00:00Z",
  "level": "INFO",
  "message": "User login successful",
  "user_id": "123456",
  "session_id": "abcde12345"
}

Using this practice, you can begin treating your logs as structured events wherever possible.  

2. Consolidate your logs at creation time

Consolidating multiple related log entries into a single coherent event can reduce log volume, improve clarity, and streamline log analysis. Rather than logging each step of a process individually, aggregate relevant details such as the status of the actions, timestamps, and any other key details into one structured log. 

For example, if you’re processing a user login, you can include whether credentials were verified, how long the process took, and the outcome all in one log. 

Multiple log entries:

"time": "2024-09-18T12:00:00Z", "level": "INFO", "message": "User authentication started", "user_id": "123456"
"time": "2024-09-18T12:00:01Z", "level": "DEBUG", "message": "Checking user credentials", "user_id": "123456"
"time": "2024-09-18T12:00:02Z", "level": "INFO", "message": "User credentials verified", "user_id": "123456"
"time": "2024-09-18T12:00:03Z", "level": "INFO", "message": "Generating session token", "user_id": "123456"
"time": "2024-09-18T12:00:04Z", "level": "INFO", "message": "Session token generated", "user_id": "123456", "session_id": "abcde12345"
"time": "2024-09-18T12:00:05Z", "level": "INFO", "message": "User login successful", "user_id": "123456", "session_id": "abcde12345"

A single log entry:

"time": "2024-09-18T12:00:00Z", "duration_ms": "5000", "message": "User login authenticated", “user.credentials.verified”: true, "request_id": "req-789xyz", "user_id": "123456", "session_id": "abcde12345"

Logged by the service, this is sometimes called a canonical log because it fully represents one request. This example is also called a wide event because it describes one significant event with many fields.

If it’s difficult to accumulate all the information into one call to the logger, consider creating a trace span instead. Information can be added to the span throughout the unit of work. 

3. Use unique identifiers

Generate a unique identifier, where a request arrives from outside your software system and include it in all processing caused by that request.

Ideally, each service in your system outputs one canonical log and is linked by a unique identifier, such as a request ID or trace ID field. These identifiers help debug complex problems faster and enable tracking for specific actions, requests, or users across systems and services.

4. Standardize log field names and types on your structured logs 

Convert your logs to the standard OpenTelemetry model. Having standard field names and type across your services makes it easier to search, analyze, and correlate logs. Without a consistent format, logs can become fragmented, leading to slower issue detection and increased complexity. 

5. Avoid logging sensitive data

Logs should never contain sensitive information such as passwords, credit card details, or personally identifiable information (PII). Logging sensitive data can lead to security vulnerabilities or compliance violations. Ensure that sensitive information is masked, excluded from logs altogether, or managed properly with a centralized logging management system.

Example:

Before masking:

Password: 12345678

After masking:

Password: *****

6. Treat your logs as data

Without effective log analysis, even well-structured logs can become overwhelming and unmanageable, making it difficult to identify patterns, detect anomalies, or track down the root cause of problems. Filter your logs. Using a combination of fields like request IDs, user IDs, url paths, or error codes you can narrow down the logs to just the relevant entries, speeding up the troubleshooting process and ensuring you’re focusing on actionable insights.

Counting logs that represent user actions can provide application metrics, like how many times an API was called and how often it succeeded. If logs include request durations, then aggregating these provides detailed latency statistics. These application metrics are even more valuable than the usual time-series aggregations, because they’re backed by the detailed log entries that you need to debug the errors and troubleshoot increased latency.

7. Use a centralized logging management system

Logs are often scattered across different services, servers, and regions in distributed systems, making log consolidation and management difficult. Use a centralized logging management system like Elasticsearch, Splunk, or Honeycomb (e.g., send log data using OpenTelemetry) to collect, aggregate, and analyze logs in one place. This enables faster searches, log analysis, and better cross-service correlation. 

A centralized logging system gives developers and engineers many benefits, including the flexibility of detailed logs for immediate troubleshooting across a distributed system and consolidated events for long-term storage or audits.

8. Configure log retention 

Logs are useful for troubleshooting, audits, and compliance—but shouldn’t be kept indefinitely. Define retention policies to automatically archive or delete old logs after a certain period. This reduces storage costs and ensures compliance with data protection regulations like GDPR, which may require deleting logs after a certain time.

9. Set up alerts

Logs aren’t just for historical reference; they can trigger real-time alerts for critical issues. Set up alerts for ERROR or FATAL level logs or specific conditions, like repeated login failures, high memory usage, or missing services. A robust strategy for log analysis should include alerts so that your team can respond to incidents quickly before they escalate into bigger problems.

10. Document log formats and practices

Ensure your log formats, logging practices, and policies are well-documented. Developers, DevOps teams, and other stakeholders should know how to generate and interpret logs. Proper documentation provides clarity and ensures that everyone follows the same guidelines, especially as teams grow or onboard new members.

Key sections to include in your documentation:

  • Log format specification (fields, data types)
  • Log level definitions
  • Retention policies
  • Sensitive data handling guidelines
  • Logging tools and systems used

How Honeycomb makes logging easier

This checklist shared logging best practices for developers and engineers seeking to improve and modernize their logging strategy. However, log analysis and log monitoring can be slow and complex to navigate, leading to inefficient troubleshooting—even when using best practices. Developers and engineers must use observability practices to understand a system’s behaviors, flaws, and performance holistically.  

Reading logs doesn’t work at scale. The more logs there are, the harder it is to find the one that matters. Log analysis comes in: counting, filtering, and graphing the contents of logs. Then, Honeycomb lets you see what’s happening at a high level: how many of each error message occurred, the full distribution of latencies, and which user IDs are seeing the most problems.

Honeycomb can also tell you what distinguishes error events from successful events. Some days, one region fails for one customer or one product. Honeycomb gets you value from every field on every structured log.

Honeycomb offers an intuitive approach to observability. You can send application and infrastructure logs to Honeycomb, capturing context-rich data used to analyze system interactions and behaviors. Honeycomb has future-proofed enterprises and corporations worldwide with rich visualizations, real-time insights, and powerful querying capabilities. Fender’s Journey to Modern Observability shares how they used their logs and Honeycomb to improve their systems and understanding of events.  

Conclusion

Good logging practices are essential for system observability, troubleshooting, and maintaining performance at scale. Without proper practices in place, logs can become noisy, unhelpful, or even a liability. In this blog, we shared ten best practices to maximize your logging efforts and answer system questions related to performance, security, and behavior. 

By structuring your logs, standardizing formats, using unique identifiers, and implementing log rotation and retention, you can ensure your logs remain actionable and efficient. Integrating centralized management, avoiding sensitive data, and documenting practices helps keep your system secure and scalable. 

Engineers and developers can take the concepts shared here to build the structured events needed for production excellence. 

New to Honeycomb? Get your free Honeycomb account today.

Published

in bookmarks

© 2010 - 2024 Daniel Nitsikopoulos. All rights reserved.

🕸💍  →