Subselectors for state:modified
See original GitHub issueThis is a follow-on from the initial proposals in #2465, #2641. Most of the required work is around exposing + sugaring the foundational work in #2695.
The way state comparison works is by identifying discrepancies between two manifests. When comparing between a past prod manifest and the current development manifest, discrepancies can be the result of two things:
- Changes made to the project in development
- Env-aware logic that causes different behavior based on the
target
, env vars, etc.
We’re going to do our best to capture only changes that are the result of development. If someone’s project has tons of intricate env-aware logic, they’ll run more models than they want (i.e. more false positives). So we’re giving them the option to turn off some knobs, in the form of more-specific subselectors.
Subselectors
There is potential for overlap: a single change can trigger multiple modification categories.
state:modified.contents
:
- Models: raw file contents changed
- Snapshots: raw file contents changed
- Data tests: raw file contents changed
- Analyses: raw file contents changed
- Seeds: raw file contents changed
- However, if the file is >1 MB, we cannot compare raw file contents, so we raise a warning and just compare based on file path instead.
This alone would get a lot of people what they want! It’s basically “just hash the files,” excluding YAML configuration.
state:modified.configs
:
- Models: changes to materialized, quoting, bind, transient, sort/dist, partition_by, incremental_strategy…
- This category captures changes in
dbt_project.yml
or{{config()}}
blocks. If the changes are made in a{{config()}}
block, they will also be picked up as content changes. - If someone has env-aware logic for materialized, where a model is a view in dev and a table in prod, they will not want to include this.
- This category captures changes in
- Snapshots: unique_key, strategy, …
- Seeds: quoting, column_types
- Schema tests: severity changes
state:modified.descriptions
:
- If
persist_docs
is turned on for a node, description changes count as modifications. (If just columns, just column descriptions; just relations, top-level descriptions; if both, then both.)
state:modified.database_representations
:
- Models: changes to the configured
database
,schema
,identifier
. This value represents the manual input only, and it’s different from the resolved database representation, which depends on thetarget
andgenerate_x_name
macros.- If someone manually sets
schema = target.schema
, orschema = target.schema ~ '_suffix'
instead of using thegenerate_schema_name
macro, that will register as a change between environments and they’ll want to turn this off. - Depending on the
generate_x_name
logic and the current environment, a chance to the configured value may not actually change the database representation. We’ll still register it as a modification.
- If someone manually sets
- Seeds: treated the same as models.
- Sources:
database
,schema
, oridentifier
has changed. If someone has env-aware definitions, they’ll want to turn this off. - Snapshots: treated the same as sources.
Default behavior
I think state:modified
should include all changes from all the categories above. The question mark is whether database_representations
should be included in the default, since this is the area where people do the most custom things, and it’s the knob that will likely be switched off most often. For the sake of clarity, I think it’s best to have the state:modified
selector be a superset of all modified
subselectors.
Future art
state:modified.macros
:
- A macro’s raw contents have changed
- By extension,
state:modified.macros+
would include all downstream models, tests, etc. that call (directly or indirectly) a macro that has changed - This also includes implicit macro dependencies such as
generate_schema_name
state:modified.vars
:
- A var value has changed
- By extension,
state:modified.vars+
would include all downstream models, tests, etc. that call (directly or indirectly) a var that has changed
We will update state:modified
to include both of these as well.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:11 (7 by maintainers)
@jtcohen6 just here to give a big 👍 to the idea of modification subselectors. Specifically,
state:modified.contents
. We have such a heavy reliance on environment variables in our dbt project that usingstate:modified
is effectively a non-starter for us right now. Would love to be able to use it in the future though!This definitely looks promising!
Up to now, current
state:modified
feature partially help us because, for some of our dbt models, we rely a lot on variables and jinja templating. Today, this forces us to always redeploy (ie.--full-refresh
) those models. And some can be costly because they materialize data.The
state:modified.macros
andstate:modified.vars
subselectors (that would be included by default instate:modified
) would a great addition to solve this problem.