Three pillars of observability

Logs, metrics, and traces are commonly known as the three pillars of observability. While there are many other data sources that can contribute to better observability for your system, these three provide a great starting point for understanding the state of your system.

Logs

Logs are records of events or messages in a system. Logs generally have a timestamp associated with its message, but can come in various forms including plaintext, structured logs (e.g., JSON) or binary format (e.g. Protobuf).

Logs help capture raw events that can be used to later determine the flow of execution or state of the system. Examples include HTTP requests, access information, or user activity.

Logs often have different levels of severity such as debug, info, or error to help categorize them.

Observability platforms often aggregate logs from multiple sources in a centralized manner to help surface problems.

Metrics

Metrics are quantitative data points that measure various aspects of a system.

Common metrics types include counters, gauges, and histograms. Each metric is often associated with a timestamp and some label along with its metric value.

Examples of metrics include CPU/memory usage, API response times, error rates, or resource utilization percentages.

Metrics are useful to determine the performance and health of the system. It is important to collect metrics not only from your application but also infrastructure that hosts your applications as well (e.g., AWS, GCP, Azure).

Traces

Traces are pieces of data that record the flow of requests through a system.

Traces are made up of spans, which record individual events within a particular service with timestamps. Traces record the entire lifecycle of a request as it flows from service to service.

While logs and traces record similar types of information, it’s important to note that logs record events of a single service, whereas traces record the flow of the request through multiple services.

In distributed systems, tracing is especially important as requests may be passed to many microservices. In such architectures, it’s often difficult to debug an issue or pinpoint the root of performance degradation without tracing.