Logging standardization - Contextual logging - Structured logging
See original GitHub issueLogging is often a crucial instrument for debugging and we are using different ways to do so.
- Python stdlib
logging
for human readable messages without contextual information - Implementation specific log calls which store logs in a
deque
with some context information, e.g.
Scheduler.log_event
/Scheduler.events
Logs external stimuli from workers and clients as an event in a dictionary by sourceScheduler.transition_log
Exclusively used to log transitions in a semi-structured format(key, start, finish, recommendations, timestamp)
Worker.log
Unstructured. Part events, part transitions, sometimes with timestamps
The problems I see with this approach are multifold
- The internal
deque
logging has been frequently the cause for memory related troubles since they accumulate memory over time and users are often not aware of this. We artificially need to limit the amount of logs to keep with options liketransition-log-length
,events-log-length
,events-cleanup-delay
, etc. - Our internal logging is not in a standardized format. Mostly there are tuples logged where the order and length is different, depending on what kind of event was logged (e.g. work stealing is different to transition, external stimuli log entirely different information)
- Neither the stdlib logging nor the implementation specific logic currently logs enough context information (that’s very subjective of course). For instance, we know the module which created the log event but not which worker or which thread issued it, let alone in what context. Context could be as simple as logging the worker name, ip, thread ID, etc. but also application specific things like computation ID (https://github.com/dask/distributed/issues/4613) of a transition (see also https://github.com/dask/distributed/issues/4037 https://github.com/dask/distributed/issues/4718)
- The split into internal and stdlib logging means that to get all logs we usually need to consolidate multiple sources. For instance, we’ll need to collect stdout/err (or however your stdlib logger is configured), scrape all workers and the scheduler. All in different formats.
- Machine readability is often not great. For the simple filtering of “give me all events belonging to a key” we have specialized functions like
story
but we need to write specialized functions for every possible query https://github.com/dask/distributed/blob/b577ece3f4bf5626d5ab6040065d8ff4d3880feb/distributed/worker.py#L1946-L1958 - Our internal logging is ephemeral by design and this is not optional or configurable
- The internal logging cannot be filtered by a level.
Most, if not all, of the above described issues can be addressed by custom solutions. For instance
- our deque loggers could be implemented as stdlib logging handlers to have one source which is highly configurable (https://docs.python.org/3/library/logging.handlers.html#logging.handlers.QueueHandler)
- Structured logging can be implemented for better machine readability https://docs.python.org/3/howto/logging-cookbook.html#implementing-structured-logging
- Log via adapters to add more context information https://docs.python.org/3/howto/logging-cookbook.html#adding-contextual-information-to-your-logging-output
- etc.
Instead of doing this all by ourselves, we could also resort to libs which are doing a great job of encapsulating this in easy to use APIs. One lib I am decently familiar with and is quite popular is structlog and I was wondering if this was something we are interested in.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Saving Time with Structured Logging - Reflectoring
Logging is the ultimate resource for investigating incidents and learning about what is happening within your application.
Read more >Structured logging - definition & overiew - Sumo Logic
A structured log has a clearly identified event number for reference, attributes, and values that comprise records and additional contextual data.
Read more >Logging — Best Practices. Logs suppose to be a consistent &…
Remember: Logs are for context, Databases are for Data! Use Correct Log Levels. In my opinion, it's sufficient to have: silly, debug, info,...
Read more >12 Standardized logging and event formats - Software Telemetry
Structured logging frameworks make encoding into negotiated standard formats much easier, allowing emitting telemetry into novel formats such as TCP sockets and ...
Read more >How to Log a Log: Application Logging Best Practices - Logz.io
The importance of implementing logging in our application architecture cannot be overstated. If — and this is a big “if' — structured correctly,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So structlog is structured logging, so a lot better than just strings of text. The problem is that’s all it is: messages at particular points in time (
loguro
is the same). A request ID included in all messages will help you trace causality somewhat, until you hit recursion, and now everything is a mess.Eliot is fundamentally different: it gives you causality, and a notion of actions that start and end. The output is a tree of actions (or really a forest of actions).
f(12)
is slow butf(0)
is fast”).See https://pythonspeed.com/articles/logging-for-scientific-computing/ — I gave Dask variant of this talk at summit earlier this year, not sure if video is available.
Eliot is one way to do this. It has Dask Distributed support built-in, for users of Distributed: https://eliot.readthedocs.io/en/stable/scientific-computing.html
Another alternative, which is attractive in that there is a bunch of existing tooling for it because a bunch of SaaS platforms and tracing software systems support, is OpenTelemetry.
Bigger picture perspective: if Dask Distributed has a good logging tracing/logging setup, and users are encouraged to use the same framework, users get to see logs that connect not just their logic but also how the distributed system is scheduling everything. Which is probably useful for performance optimization.
While this is prone to change based on this discussion, would it be worthwhile giving more information on log config options in the configuration reference? Not sure how heavily trafficked that page is but I recall going there looking for log config options (such as the ability to output to file) and assuming they didn’t exist because I didn’t see any options listed there.