Structured Logging (for JSON)
See original GitHub issueHey all, this is my very first issue - so please bear with me 😃 I’ve checked open as well as closed issues around the keywords I’d suspect to match and blame’d the files in question to get an idea what previous ones could be relevant.
Describe the feature
The logging that is currently in place seems to be mainly tailored for human beings reading the output, which is absolutely okay - but makes it way harder to process the generated logs programmatically.
In the end I’d like to present proper error reasons around failed models with all the relevant information in one record.
running the tests with dbt --log-format=json test
for example generates the following in logs/dbt.log
:
2020-11-24 09:09:04.745841 (MainThread): Warning in test not_null_finance__some_statuses_per_day_pan (models/marts/finance/finance__some_statuses_per_day.yml)
2020-11-24 09:09:04.750021 (MainThread): Got 55169 results, expected 0.
2020-11-24 09:09:04.755953 (MainThread):
2020-11-24 09:09:04.758927 (MainThread): compiled SQL at target/compiled/project/models/marts/finance/finance__some_statuses_per_day.yml/schema_test/not_null_finance__some_statuses_per_day_pan.sql
while on STDOUT
we get (pretty printed for easier reading):
{
"timestamp": "2020-11-24T09:09:04.745841Z",
"message": "\u001b[33mWarning in test not_null_finance__some_statuses_per_day_pan (models/marts/finance/finance__some_statuses_per_day.yml)\u001b[0m",
"channel": "dbt",
"level": 13,
"levelname": "WARNING",
"thread_name": "MainThread",
"process": 5382,
"extra": {
"run_started_at": "2020-11-23T14:42:09.667614+00:00",
"invocation_id": "0ccc471d-b39e-4387-ae44-362094d1ad0a",
"is_status_message": true,
"run_state": "internal"
}
}
{
"timestamp": "2020-11-24T09:09:04.750021Z",
"message": " Got 55169 results, expected 0.",
"channel": "dbt",
"level": 14,
"levelname": "ERROR",
"thread_name": "MainThread",
"process": 5382,
"extra": {
"run_started_at": "2020-11-23T14:42:09.667614+00:00",
"invocation_id": "0ccc471d-b39e-4387-ae44-362094d1ad0a",
"is_status_message": true,
"run_state": "internal"
}
}
{
"timestamp": "2020-11-24T09:09:04.758927Z",
"message": " compiled SQL at target/compiled/project/models/marts/finance/finance__some_statuses_per_day.yml/schema_test/not_null_finance__some_statuses_per_day_pan.sql",
"channel": "dbt",
"level": 11,
"levelname": "INFO",
"thread_name": "MainThread",
"process": 5382,
"extra": {
"run_started_at": "2020-11-23T14:42:09.667614+00:00",
"invocation_id": "0ccc471d-b39e-4387-ae44-362094d1ad0a",
"is_status_message": true,
"run_state": "internal"
}
}
it originates from those lines:
and i cannot even relate those lines because they have no common property.
I was toying around with the idea to refactor the way that this logging works. Which would allow the log output stay unchanged while the structure output in json could receive way more information. For the first shot probably just all available information in Logbook’s extra
dict - nothing too fancy.
Describe alternatives you’ve considered
Nothing that comes to mind initially, while I’m open for different ideas to get the information I’m looking for
Additional context
To my understanding this could/should be a very generic thing, that is - depending on the implementation - not changing much for anyone using the default configuration.
Who will this benefit?
Everyone consuming logs in a different format that primarily is interesting for accessing it programmatically. Also everyone who is using any kind of logging service - it would allow people to access aggregated information in one record, rather than splattered over multiple lines
Are you interested in contributing this feature?
I’ve been in touch with dbt only as a User/Admin until now, but have already started to dig into the code base and would like to help - given your assistance and feedback.
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (7 by maintainers)
@steffkes you raise some fair points! After rereading above, I also see that I missed some of the specific details in the scenarios you were laying out—sorry about that.
To explain my rationale here, I’d like to draw a distinction between:
In the standard CLI output, this separation occurs between the
Finished
andCompleted
lines:1. Real-time statuses
Up to the word “Finished,” the logs populate in real time, and they can provide useful information about which models are running, whether tests are passing, and so on. Here’s one of the same log lines as above, now JSON-formatted:
And here’s the same line for the same test, now configured with error-level severity:
Those are the lines that we’d hope monitoring would catch by checking the level/levelname for
WARNING
andERROR
. Theextra.unique_id
is what you can use to identify the specific resource that’s running.You can use that
unique_id
to look up more information about the test in thenodes
object inmanifest.json
, which includes its config, parents, and compiled SQL. Based on your feedback above, it sounds like that’s the contextual info you want in theextra
dict, and I am open to the feedback that we could surface more information there.2. Invocation-level summary
Aggregating, filtering, and summarizing what happened in a given run after the run is over is where the JSON artifacts really come into play. The combination of
run_results.json
andmanifest.json
will have a richer, better-organized, and more stable set of information than what’s available in the logs. Here’s a subset of the information about that one test fromrun_results.json
:Granted, it’s still not as good as it could be; we are reorganizing that information to be more straightforward and intuitive in the next release of dbt (#2493).
That being said, you’ve got a really good point: the JSON representation of the summary lines starting with
Completed
are not all that helpful, since these are definitely optimized for human readability in stdout. These are the lines produced for me (running with dbt v0.18.1):I’d welcome some changes to better coordinate those “summary” log lines in their JSON output, something like:
Let me know what you think, and if the distinction I’ve drawn above makes sense for your needs.
Don’t want to get lost in the details … I’ve tuck a stab at this https://github.com/fishtown-analytics/dbt/compare/dev/kiyoshi-kuromiya...steffkes:feature/2915-structured-logging :
that doesn’t change the
logs/dbt.log
not by much:but would generate one log entry like this:
which would allow you way more in terms of aggregating and filtering stuff, don’t you think @jtcohen6 ?
It’s probably not the way it should/would be implemented in the end, more like a quick way to demonstrate the idea i had in mind - without changing to much of the existing code, making it easy/easier to follow it.