New node type: reports
See original GitHub issue**Edit: I previously called these exposures
. I’ve changed it to report
to keep it more tangible for this first version.
dbt Core needs to have a semantic understanding of a report
. A “report” is a generalization of a “dashboard”, though it could also be an ML model, Jupyter notebook, R Shiny app, in-app data viz, etc.
The report
node serves two purposes:
- Define a set of models as dependencies
- Define metadata to populate an embed tile and special dbt-docs “landing page”
Why?
- We want to be able to push information about data quality and lineage from dbt into external tools. Exposures should be the Core side of that contract.
- The same reason we added
sources
: there’s a big piece missing from the DAG today. Once downstream uses of dbt models are registered asreports
, dbt project maintainers can start asking questions like:- Which models are our most critical dependencies?
- Which final models are only used by exposures with very few/irregular views?
- Which final models aren’t used at all?
Tentative spec
Edit: updated based on comments below
models/anything.yml
version: 2
exposures:
- name: orders_dash
type: dashboard # one of: {dashboard, notebook, analysis, ml, application}
url: https://fishtown.looker.com/dashboards/1
maturity: high # i.e. importance/usage/QA/SLA/etc. one of: {low, medium, high}
description: >
This is a dashboard of monthly orders, with customer
attributes as context.
depends_on:
- ref('fct_orders')
- ref('dim_customer')
- source('marketplace', 'currency_conversions')
owner:
email: jeremy@fishtownanalytics.com # required
name: "Jeremy from FP&A" # optional
Core functionality
By way of depends_on
, I’d expect dbt run -m +report:orders_dash
to run all upstream models, and dbt test -m +report:orders_dash
to test all upstream nodes. dbt run -m +report:*
would run all models upstream of all reports. An report cannot itself be run or have tests defined.
Edit: I changed this syntax to feel more like the source:
model selector method. Rationale: dbt run -m orders_dash
has no effect; it’s worth calling out that this is a special thing.
Open questions
- What should be the docs “landing page” for a
report
? I’ll write up a related dbt-docs issue - Crucially, we’ll need a mechanism that can parse a report’s
depends_on
frommanifest.json
, compile the list of all upstream nodes, and then search inrun_results.json
andsources.json
for all associated tests and freshness checks. Where should that mechanism live, exactly? owner
: There is added benefit (and constraint) to tying this to a dbt Cloud user ID. Should we try to make a mapping via email only, instead?type
+maturity
: Should we constrain the set of options here? Offer them as free text fields? I like the eventual idea of summarizing things like:
fct_orders is directly exposed in:
3 dashboards, of varying maturity (high: 2, medium: 1)
1 low-maturity ML pipeline
2 medium-maturity apps
fct_orders indirectly powers:
2 medium-maturity dashboards
Maybe that’s in dbt-docs, maybe that’s an ls
command, maybe it’s a pipe dream. This is the piece that feels least critical for the first version.
@drewbanin to help suss out some answers
Future work (not v1)
- Can reports declare non-model nodes in their
depends_on
? (Could reports depend on other reports?) - Can reports modify expectations of upstream tests? I could imagine overriding a test severity, or defining a set of tests to exclude from consideration.
- Can reports be tagged?
- Can they accept arbitrary additional fields (i.e.
meta
)? - Can we closely tie
owner
to a dbt Cloud user by email? Purpose: configurable notification channels, asset rendering in dbt-docs
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:7 (5 by maintainers)
@bashyroger I agree with a lot of what you said above. If you’ll humor me, I think this is a yes/and rather than an either/or.
dbt could, and someday should, pull information from database query logs to get a sense of how models are being used in production. There would be tremendous benefit in identifying unused final models that are good candidates for deprecation, or shaky models that serve as cornerstones to crucial reports yet lack the commensurate testing. This tooling would be immensely valuable to dbt developers, and I completely agree that to such an end there should be a full decoupling between dbt resources and our ability to see how they’re queried in the wild. dbt maintainers should see usage information for dashboards they know about and (especially) dashboards they don’t.
The matrix of tools here is a challenge, but not an insurmountable one. It sounds like you’ve made some significant progress on this for BigQuery + Looker. Any chance there are code snippets or approaches you’d be interesting in contributing back? 😄
At the same time, there’s a lot of useful information already stored in dbt artifacts today—source freshness, test success—that we can and should put in front of data consumers and other downstream beneficiaries of dbt models. We think that means embedding it in the BI/application/usage layer. This is a subtle but significant distinction IMO:
To my mind, this roughly maps to the distinction, mentioned in Tristan’s and Jia’s “Are dashboards dead?” conversation, between ad hoc exploration on the one hand + dashboards with service-level guarantees on the other. We may see different, more precise tooling emerge to support each use case, which I really do believe to be quite different. Ultimately, I think dbt developers will want to know about both categories—in-the-wild querying and sanctioned usage—and we’ll want to find compelling ways to integrate that information in the long run.
if you want to support refables and sources (which I think is a good idea!), what about just using
ref
andsource
?Users should at least some level of comfortable with the behavior of ref/source already.
Allowing sources will probably be nice for testing, too (zero database operations required).