smhk

Python log storage

OpenTelemetry is for collecting telemetry data (metrics, logs and traces), and forwarding them to storage and analysis tools. Here are more details on what it is.

Logs, Spans and Traces §

Logs, spans and traces are key components of Open Telementry:

  • Logs are a timestamped message, structured or unstructured. They are not so useful for tracking code execution, but historically have been relied upon heavily. (More details).
  • Spans track an operation, recording what happened during the time in which the operation was executed. They are the building blocks of traces. (More details).
  • Traces consist of one or more spans. All spans have a reference to their parent span, except for the root span. This allows for visualising traces as a waterfall diagram. (More details).

OpenTelemetry is not a backend. It does not handle storage or visualization.

Collector §

You can send data directly from your application to the backend, however it is recommended to use a collector, which is a service that runs alongside your application. The advantage of a collector is that it takes care of things like retries, batching, encryption and sensitive data filtering.

Backend §

OpenTelemetry requires a backend to store, process, analyze and visualize the telemetry data. Example backends are:

  • Prometheus: “Prometheus […] is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.”
  • Loki: “Loki: like Prometheus, but for logs. Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate. It does not index the contents of the logs, but rather a set of labels for each log stream.”
  • ElasticSearch: “Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning fast search, fine‑tuned relevancy, and powerful analytics that scale with ease.” Free to self-host, commercial cloud offering.
  • ClickHouse: “An open-source, column-oriented database management system that allows generating analytical data reports in real-time”. Free to self-host, commercial cloud offering.

Visualization §

While some backends include the visualization capability, there is also visualizations software that builds upon the backends. There is some overlap here, but also some additional benefit.

  • Grafana: “Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data-driven culture”, pairs well with Prometheus as the backend.
  • Kibana: “Kibana is your window into the Elastic Stack. Specifically, it’s a browser-based analytics and search dashboard for Elasticsearch,” only compatible with the ElasticSearch backend.

Python support §

Support for sendings logs from Python is currently “experimental” in OpenTelemetry. However, support for traces and metrics is “stable”.

To use in conjunction with mypy, you need to turn on namespace packages.

Bonus §

The original design doc for Loki is a good read. Published in 2018, it justifies the need for a simpler log aggregation system:

As mentioned above, an increasingly common pattern is the use of time series monitoring in combination with log aggregation - therefore the flexibility and sophistication offered by the query facilities often go unused; the majority of queries just focusing on a time range and some simple parameters (host, service etc). Using these systems for log aggregation is akin to using a sledgehammer to crack a nut.

It succinctly explains how logging is typically done in microservices land these days:

An interesting aside is to consider how logging has changed in the modern cloud native / microservices / containerised workloads. The standard pattern is now for applications to simply writes logs to STDOUT or STDERR. Platforms such as Kubernetes and Docker build on this to offer limited log aggregation features; logs are stored locally on nodes and can be fetched and aggregated on demand, using label selectors.

But with these simple systems, logs are often lost when a pod or node disappears. This is often one of the first triggers for a buyer to realise they need log aggregation - a pod or node mysteriously dies and no logs are available to diagnose why.