Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

User-defined default selection criteria

See original GitHub issue

Credit where due: This was @aescay’s idea many months ago!

Describe the feature

We’ve increasingly heard a desire from community members to limit dbt run in development to only a subset of their project, or to always exclude a set of infrequently run models. Trying to solve for this with enabled gets tricky quickly, because disabled resources cannot participate in the DAG and will raise dependency errors accordingly.

Instead, what if there were a way to redefine the default selection criteria? I think the right construct for this is a yaml selector (as in selectors.yml). We could support a new selector property, default: true|false, or check for a selector named default. The former is almost certaintly better, we’d just need to raise an error if multiple selectors have default: true.

If specified, that selector should override the default includes + excludes defined here:

https://github.com/fishtown-analytics/dbt/blob/abe8e839458d0288f76cb6950e79c77e3b1627cc/core/dbt/graph/cli.py#L22-L23

Then, dbt run would have the same effect as dbt run --selector my_default_selector.

Questions

How to combine the user-defined default with additional selection criteria passed via CLI syntax? Should the two be combined, or should the CLI criteria entirely override the default? I’m leaning toward total override for two reasons:
- It’s what happens today: dbt ls really means dbt ls --select fqn:* source:* exposure:*, as soon as the user says -s something_else, the command becomes dbt ls -s something_else
- Override feels like the only way to make sense of dbt ls --selector not_the_default
Can we support different default selectors in different environments? I think it makes a lot of sense to perma-exclude certain models in development, but still select them in production. As it turns out, selectors.yml already supports Jinja today, so this could be as simple as:

selectors:
  - name: prod
    description: Select everything in prod
    default: "{{ target.name == 'prod' | as_bool }}"
    definition: 'fqn:* source:* exposure:*'
  - name: dev
    description: Avoid unpleasant surprises in dev
    default: "{{ target.name == 'dev' | as_bool }}"
    definition:
      union:
        - 'fqn:* source:* exposure:*'
        - exclude:
            - unpleasant
            - surprises

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:14 (14 by maintainers)

Top GitHub Comments

1reaction

jtcohen6commented, Aug 23, 2021

@TeddyCr Nice progress so far! Sorry I was out last week, so just responding now. I have a few quick comments:

where to get selector info?

I think you want to use self.selectors rather than self.manifest_selectors (which has already been serialized + expanded, to print a user-friendly version in manifest.json). In order to check self.selectors for the default selector, you’ll need to include default: True|False in SelectorConfig, and also the output of the parse_from_selectors_definition method, since they currently just include the name and definition of each selector.

I think there are a few different ways to go about this. The one I can think of is to change SelectorConfig from being just Dict[str, SelectionSpec] to instead being Dict[str, Dict[str, str], Dict[str, SelectionSpec]].

For the example selectors in the original comment, instead of:

{'prod': SelectionCriteria(...), 'dev': <dbt.graph.selector_spec.SelectionDifference object at 0x105eb6400>}

They would be represented as:

{'prod': {'default': False, 'definition': SelectionCriteria(...)}, 'dev': {'default': True, 'definition': <dbt.graph.selector_spec.SelectionDifference object at 0x105eb6400>}}

Then get_selector would return self.selectors[name].definition, instead of just self.selectors[name].

Alternatively, you could grab the name of the default selector as a pointer earlier on, and store that somewhere. I think you’ll need to alter the selectors config object one way or another.

order of operations

Once you can reliably access the default selector definition from the project config, I think you want to slightly adjust the order of the logic you’ve got above, so that you do not use the default selector if the user has passed --models or --exclude. How about something like:

    def get_selection_spec(self) -> SelectionSpec:
        # get this first, so we can check to see if it exists in the elif condition below
        default_selector = self.config.get_default_selector()
        logger.info(default_selector)  # TODO remove after debugging
        if self.args.selector_name:
            spec = self.config.get_selector(self.args.selector_name)
        # do not use default selector if the user has passed --models or --exclude,
        # or if it is not defined
        elif not (self.args.models or self.args.exclude) and default_selector:
            spec = default_selector
        else:
            spec = parse_difference(self.args.models, self.args.exclude)
        return spec

testing

The relevant existing unit tests are in test_graph_selector_parsing. If you make changes to the structure of SelectorConfig/selectors as I mentioned above, it’s likely you’ll need to adjust this and a few other tests that mock what selectors look like.

I also think we’ll want to add an integration test, to make sure this works end-to-end!

Let me know if you find the comments above helpful, and if you’re able to give it another go 😃

1reaction

jtcohen6commented, Aug 3, 2021

@TeddyCr I had been thinking that the change here should be in graph/cli.py, and that the approach should be to check selectors and, if a default selector is found, override the default includes + excludes, which are used as inputs to parse_difference lower down.

The way this works in practice, is that a task calls get_selection_spec. First it checks to see if a --selector was passed, otherwise, it passes --models/--select and --exclude into parse_difference:

https://github.com/dbt-labs/dbt/blob/45fe76eef4b4b82ae1442f00310e0c6a121774f2/core/dbt/task/list.py#L171-L176

https://github.com/dbt-labs/dbt/blob/45fe76eef4b4b82ae1442f00310e0c6a121774f2/core/dbt/task/compile.py#L40-L45

https://github.com/dbt-labs/dbt/blob/45fe76eef4b4b82ae1442f00310e0c6a121774f2/core/dbt/task/freshness.py#L139-L150

So now I’m thinking a better approach here may be to add a check within get_selection_spec. If none of --selector_name, --models/--select, and --exclude is set, then self.config.get_selector('default'). Here’s some code for task/list.py:

 def get_selection_spec(self) -> SelectionSpec: 
     if self.args.selector_name: 
         spec = self.config.get_selector(self.args.selector_name)
     elif not (self.args.models or self.args.exclude):
         # grab by new 'default' property rather than name
         # need some way to check that a default selector is defined!
         spec = self.config.get_selector("default")
     else:
         spec = parse_difference(self.args.models, self.args.exclude) 
     return spec

This works for a selector named default! But let’s actually do this by defining a new selector property, default: true|false, and adding that as an argument to get_selector. We’ll also want to handle the (very common) case in which a default selector is not defined, by having some way to check for it first.