Doc (and potentially, Test) Inheritance
See original GitHub issueDescribe the feature
I’m a big believer in the power of documentation, and I love dbt’s doc generation and testing, however, there is still a massive amount of manual work that goes into maintaining docs and tests, especially if you have many layers, where many of the column descriptions aren’t actually changing.
I’m finding I’m consistently spending 70% of my development time just wiring up documentation through doc blocks. Same for tests, if you want to test every layer. At least upstream tests can catch some data issues, but you can’t really do the same with docs without reading the source code SQL.
There are 3 cases that have to be accounted for with documentation (and tests, in fact):
- The names of the columns have not changed from upstream models
- There has been a renaming of a column compared to an upstream model
- There has been a meaningful transformation or additional column, compared to an upstream model
One can either address each of those concerns incrementally or come up with a solution that solves all 3.
Ideal Solution
In an ideal world, issues 1 and 2 could be solved if dbt could parse the SQL in all models using something like sqlparse
to get true field-level lineage.
I know your first thought is probably “What about all the different dialects?” but I would say to that, we don’t need to parse everything. We only need the SELECT [fields]
and AS
in the bottom-most expression. In fact, it might even be possible with regex, and is most likely the same across all dialects.
This could
- Inherit any docs in Case 1 or Case 2.
- Optionally Scaffold the description yml keys for Case 3, to be filled in by the user (can be broken out to a separate feature)
I know this is a huge ask and involves major work, which is why I’m gonna put forward an incremental solution as well
Incremental Solution
An easier-to-implement, but not as valuable solution would be to make this config-driven on a field by field basis. You still have to write quite a bit of boilerplate, but at least you’re not copy-pasting (which leads to inconsistencies), or spinning up hundreds of model.md
files to make doc block references (still very error-prone to get the doc block names right)
This might be something like:
# schema.yml
version: 2
models:
- name: model_2
description: "Unique description for model 2"
columns:
- name: col2
description:
inherit: upstream1 # Case 1 - No change in column name, so we just give the upstream model name
tests: # Put this here for demonstration, cause it was juicy :)
inherit: upstream1
- name: col1
description:
inherit: src1.column_one # Case 2 - Col name was renamed
- name: col3
description: "This is brand new documentation" # Case 3 - Meaningful transformation or addition in this model
Alternative syntax might be:
# schema.yml
version: 2
models:
- name: model_2
description: "Unique description for model 2"
columns:
- name: col2
description: '{{ doc_inherit(ref("upstream1") }}' # Case 1 - No change in column name, so we exclude the second argument
- name: col1
description: '{{ doc_inherit(source("src1"), "column_one" }}' # Case 2 - Col name was renamed
- name: col3
description: "This is brand new documentation" # Case 3 - Meaningful transformation or addition in this model
Describe alternatives you’ve considered
Currently in order to solve this, I’m defining a new .md
file for every model and using a doc
reference in my schema.yml files. In order to keep things DRY and keep me sane, instead of referring to the immediate parent in every model, I refer to the ancestors (grandparents) of the model if possible.
For example, if source1 -> model1 -> model2
all share a field name , model2 refers to the doc of source1
instead of model1. This way, I’m only changing descriptions in 1 place if the need arises.
Who will this benefit?
I think implementing the Ideal Solutoin would have a major impact not only for speeding up documentation and keeping it up-to-date, but also allows you to apply the same strategy to tests, which is huge.
In the Incremental Solution, I’d say this would save me about 30-40% of my time for developing each feature. This is because I don’t have to make additional doc.md
files and don’t have to worry about getting the doc()
references correct, which can be a nightmare in a complex project.
I think if documentation (and maybe tests) are your thing, this solution would pave the way for saving tens of thousands of dollars over the life of a project in time saved.
Are you interested in contributing this feature?
Maybe if I had a better idea of how dbt works under the hood based on a guide or something.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:51
- Comments:12 (5 by maintainers)
Top GitHub Comments
Sharing our use-case here in case it might spark some more ideas from the community around this feature.
At my organization, I just developed a Python package to try to solve some of the issues described by the original poster (avoid duplicating documentation on a multi-layered data warehouse architecture). My approach was to develop an external Python package that parses the manifest and catalog and propagates the column documentation in case a column is not documented in a model but an upstream model has a column with the same name that is documented. The new inherited column descriptions are then written in a new manifest that we use in our “dbt docs release pipeline”.
Some of the features we implemented and the rationale behind:
id
,created_at
, etc).schema.yml
files (and thus, are not in the dbt manifest) are coveredis_active
can have a meta tag stating “also inherit from upstream columns namedactive
”.manifest.json
) is a list of which columns are currently not documented but would be propagated to most places if they were. This helps our analysts to add documentation to places where it might bring more value. Also, it shows how many columns in the entire project are documented, how many are not documented, and how many were propagated.Here is a simple example, taken from a sample dbt project used as part of the test suite of the package:
We will actually release this into production next week. If there’s interest from the others, I can try to come back here after a few weeks with our learnings. Also if there’s enough interest (maybe by reacting to this comment?), I can try to convince management to let me spend time writing a blog post and open-sourcing this project.
Hi guys. I created a quick python script to generate docs block from yml files. This script is independent with dbt manifest file, since
dbt compile
won’t run if the docs block is not found. https://gist.github.com/ducchetrongminh/c494d867feec925a5b59515714778279Copy this file to your repo. Your workflow will change a little bit, but it saves time than creating md or copy your docs
python -m path.to.dbt_docsblock_autogenerator && dbt docs generate
(ordbt compile
)Then you just need to reference the docs block
Or even add an extra doc. Cool?
Hope this can save your time. Happy documenting 😄