Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Generalized `dbt build` command

See original GitHub issue

See also: #1054, #1227, #2234, this comment

Describe the feature

Each dbt node-resource type has a task-command associated with it:

models = dbt run
tests = dbt test
seeds = dbt seed
snapshots = dbt snapshot
sources = dbt source snapshot-freshness

Additionally, there could be a generalized command dbt build¹ that would step through a DAG of multiple resource types and “build” them accordingly.

What would this look like? I imagine an argument syntax similar to dbt ls, i.e.

dbt build --select ... --exclude ... --resource-type ...

¹ name subject to change, though for the ultimate command of the data build tool, it’d be hard to think of one more apropos…

Example

Let’s imagine we had model_a that depends on a source (my_source.table) and a seed (my_seed), a snapshot (my_snapshot) of model_a, and then model_b which selected from my_snapshot. Of course, we also have tests on many of them. Roughly:

my_source.table --> my_seed --> model_a --> my_snapshot --> model_b

Within a single invocation, dbt build would go through motions analogous to running the following dbt commands. It would only proceed to the next numerical steps if all upstream steps succeed:

1a. dbt seed my_seed 1b. dbt source snapshot-freshness --select my_source.table 2a. dbt test --models my_seed 2b. dbt test --models source:my_source.table 3. dbt run --models model_a 4. dbt test --models model_a 5. dbt snapshot --select my_snapshot 6. dbt test --models my_snapshot 7. dbt run --models model_b 8. dbt test --models model_b

Complexities

Some of these tasks are already DAG aware (run, test, snapshot), some are not (seed, snapshot-freshness)
Commands support several different flags
- How to expose when a flag is being used, and when it isn’t?
- What about same-named flags that do subtly different things across commands? e.g. dbt run --full-refresh vs. dbt seed --full-refresh
Node types are just about 1:1 with task types, though dbt test almost feels like an exception. Technically, dbt test operations on test nodes, but other node types can be passed into its selection syntax, with selector expansion as the last step, so it “feels” like you’re testing a model or a snapshot. (Edit: this behavior may someday change.)
This risks a lot of our existing intuitions that come from having resource types nicely delineated. Put differently: what if it all just falls apart?
- What if it works so well that 90% of dbt deployments are just dbt build? Should we be weary of creating one command to rule them all?

Describe alternatives you’ve considered

Doing a more particularized version of this, e.g. dbt run+test (as outlined in linked issues)
Not doing this at all, and leaving the federation of one resource type = one command/invocation. Is this a good abstraction that we should fight to keep?

Who will this benefit?

Bigger, more complex projects who want to run subsets of different resource types. Today, that can only be accomplished through complex selection syntax leveraging tags. YAML selectors improves this somewhat, but they’re not the answer.
Projects with snapshots that participate in the middle of the DAG
Deployments that want to test upstream models before running downstream models, so as to alert earlier and save compute time/$$ in the event of failure

Issue Analytics

State:
Created 3 years ago
Reactions:10
Comments:13 (8 by maintainers)

Top GitHub Comments

3reactions

drewbanincommented, May 14, 2021

@jtcohen6 I have been stuck on this idea that I just cannot shake! Wanted to mention it here.

IF:

A project has sources configured AND
dbt is configured to run dbt source snapshot-freshness AND
dbt has a way to compare 1) freshness information and 2) model logic across invocations AND
dbt has knowledge of which materializations map to views vs. tables

THEN:

a generalized dbt build command would be well-positioned to skip running models where a rebuild would result in exactly the same database object that already exists in the database

I think there’s some more formality / rigor to apply here, and I’m actually not 100% sure that this requires the existence of a dbt build command, but wanted to throw it out there for consideration.

To get more concrete, here are some of the examples I’m considering: A view model only need to be built when:

its logic has changed
it does not already exist in the database

A table/incremental model only needs to be built when:

Its logic has changed OR
it does not already exist in the database OR
An upstream model’s logic has changed OR
An upstream source’s ~logic~ data has changed

I think that we can get at a lot of this stuff with the state: or config.materialized selectors, so really my thinking boils down to:

Something like a new freshness/source selector?
Maybe some sort of packaging that makes these types of selectors more concise?
A way to specify this logic as a part of the generalized dbt build command?

2reactions

jtcohen6commented, Jul 1, 2021

@kosti-hokkanen-supermetrics Cool to hear what you’re hoping to do with it! The first cut of dbt build won’t allow much configuration, and its behavior will be defined by some opinionated rules, including:

run a model before testing it
failures in those tests block downstream models from running

That said, I believe all the right constructs are there. I bet you can combine several dbt build invocations, paired with test severity and thoughtful node selection, to accomplish the thing you’re after.