question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Generalized `dbt build` command

See original GitHub issue

See also: #1054, #1227, #2234, this comment

Describe the feature

Each dbt node-resource type has a task-command associated with it:

  • models = dbt run
  • tests = dbt test
  • seeds = dbt seed
  • snapshots = dbt snapshot
  • sources = dbt source snapshot-freshness

Additionally, there could be a generalized command dbt build1 that would step through a DAG of multiple resource types and “build” them accordingly.

What would this look like? I imagine an argument syntax similar to dbt ls, i.e.

dbt build --select ... --exclude ... --resource-type ...

1 name subject to change, though for the ultimate command of the data build tool, it’d be hard to think of one more apropos…

Example

Let’s imagine we had model_a that depends on a source (my_source.table) and a seed (my_seed), a snapshot (my_snapshot) of model_a, and then model_b which selected from my_snapshot. Of course, we also have tests on many of them. Roughly:

my_source.table --> my_seed --> model_a --> my_snapshot --> model_b

Within a single invocation, dbt build would go through motions analogous to running the following dbt commands. It would only proceed to the next numerical steps if all upstream steps succeed:

1a. dbt seed my_seed 1b. dbt source snapshot-freshness --select my_source.table 2a. dbt test --models my_seed 2b. dbt test --models source:my_source.table 3. dbt run --models model_a 4. dbt test --models model_a 5. dbt snapshot --select my_snapshot 6. dbt test --models my_snapshot 7. dbt run --models model_b 8. dbt test --models model_b

Complexities

  • Some of these tasks are already DAG aware (run, test, snapshot), some are not (seed, snapshot-freshness)
  • Commands support several different flags
    • How to expose when a flag is being used, and when it isn’t?
    • What about same-named flags that do subtly different things across commands? e.g. dbt run --full-refresh vs. dbt seed --full-refresh
  • Node types are just about 1:1 with task types, though dbt test almost feels like an exception. Technically, dbt test operations on test nodes, but other node types can be passed into its selection syntax, with selector expansion as the last step, so it “feels” like you’re testing a model or a snapshot. (Edit: this behavior may someday change.)
  • This risks a lot of our existing intuitions that come from having resource types nicely delineated. Put differently: what if it all just falls apart?
    • What if it works so well that 90% of dbt deployments are just dbt build? Should we be weary of creating one command to rule them all?

Describe alternatives you’ve considered

  • Doing a more particularized version of this, e.g. dbt run+test (as outlined in linked issues)
  • Not doing this at all, and leaving the federation of one resource type = one command/invocation. Is this a good abstraction that we should fight to keep?

Who will this benefit?

  • Bigger, more complex projects who want to run subsets of different resource types. Today, that can only be accomplished through complex selection syntax leveraging tags. YAML selectors improves this somewhat, but they’re not the answer.
  • Projects with snapshots that participate in the middle of the DAG
  • Deployments that want to test upstream models before running downstream models, so as to alert earlier and save compute time/$$ in the event of failure

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:10
  • Comments:13 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
drewbanincommented, May 14, 2021

@jtcohen6 I have been stuck on this idea that I just cannot shake! Wanted to mention it here.

IF:

  • A project has sources configured AND
  • dbt is configured to run dbt source snapshot-freshness AND
  • dbt has a way to compare 1) freshness information and 2) model logic across invocations AND
  • dbt has knowledge of which materializations map to views vs. tables

THEN:

  • a generalized dbt build command would be well-positioned to skip running models where a rebuild would result in exactly the same database object that already exists in the database

I think there’s some more formality / rigor to apply here, and I’m actually not 100% sure that this requires the existence of a dbt build command, but wanted to throw it out there for consideration.

To get more concrete, here are some of the examples I’m considering: A view model only need to be built when:

  • its logic has changed
  • it does not already exist in the database

A table/incremental model only needs to be built when:

  • Its logic has changed OR
  • it does not already exist in the database OR
  • An upstream model’s logic has changed OR
  • An upstream source’s ~logic~ data has changed

I think that we can get at a lot of this stuff with the state: or config.materialized selectors, so really my thinking boils down to:

  • Something like a new freshness/source selector?
  • Maybe some sort of packaging that makes these types of selectors more concise?
  • A way to specify this logic as a part of the generalized dbt build command?
2reactions
jtcohen6commented, Jul 1, 2021

@kosti-hokkanen-supermetrics Cool to hear what you’re hoping to do with it! The first cut of dbt build won’t allow much configuration, and its behavior will be defined by some opinionated rules, including:

  • run a model before testing it
  • failures in those tests block downstream models from running

That said, I believe all the right constructs are there. I bet you can combine several dbt build invocations, paired with test severity and thoughtful node selection, to accomplish the thing you’re after.

Read more comments on GitHub >

github_iconTop Results From Across the Web

build - dbt Developer Hub
The dbt build command will: run models; test tests; snapshot snapshots; seed seeds. In DAG order, for selected resources or an entire ...
Read more >
Generalized `dbt build` command · Issue #2743 - GitHub
Each dbt node-resource type has a task-command associated with it: models = dbt run; tests = dbt test; seeds = dbt seed; snapshots...
Read more >
17 dbt Commands You Should Start Using Today - Medium
Main commands ; dbt init: Initializes a new dbt project. ; dbt run: Runs all models within the project. ; dbt test: Tests...
Read more >
dbt Guide - GitLab
dbt, short for data build tool, is an open source project for managing data transformations in a data warehouse. Once data is loaded...
Read more >
Creating A DBT Project | Timeflow Academy
DBT CLI vs DBT Cloud · Command Line Interface (CLI) - This involves running a command line tool to manage and execute your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found