[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project
See original GitHub issueIs this a new bug in dbt-core?
- I believe this is a new bug in dbt-core
- I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build a single model itself takes 15 mins. Have disabled analytics tracking event thought that could have caused but still no luck. However, dbt run is faster but I can’t go with just run as tests required immediately before proceed to next model in DAG.
Expected Behavior
Models could be more depends project size but building single model not supposed to take 15 mins and it does not look realistic when we have multiple dbt commands to run.
Steps To Reproduce
version: dbt 1.0 till latest release 1.3
total nodes count: 3K+ including models, tests, snapshots, seeds
build single model: $dbt build -s test_model
Relevant log output
No response
Environment
- OS: Windows/DBT Cloud
- Python: 3.7.4
- dbt: 1.2
Which database adapter are you using with dbt?
snowflake
Additional Context
No response
Issue Analytics
- State:
- Created a year ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
Debugging errors - dbt Developer Hub
The error message dbt produces will normally contain the type of error (more on these error types below), and the file where the...
Read more >Best practice guides - dbt Developer Hub
Best practices. Best practice guides. Learn how dbt Labs approaches building projects through our current viewpoints on structure, style, and setup.
Read more >Analyzing Fishtown's dbt project performance with artifacts
json : a full representation of your dbt project's resources (models, tests, macros, etc), including all node configurations and resource ...
Read more >Using dbt artifacts to track project performance - Show and Tell
I wanted to provide a more in depth guide to how we implemented using run_results with Dagster and Snowflake (more than I could...
Read more >[CT-1353] [Bug] Performance issues with DBT build when nodes ...
In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
thank you @jtcohen6 for the response and references to ongoing discussions.
Have gone through code snippet where its taking time to build graph… I’m intrigued to see below block implementation where its removing all nodes/edges instead of just taking subgraph using networkx module. Is there any specific reason its implemented in such way using loops and going through each nodes check and remove.
As per my understanding of the same logic, it can be achieved through just using subgraph like below… It just faster and simpler code. have tested this, and it works faster and achieves same result. Would this be amended with logic or you see any issues?
Thank you!!!
Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case
graph.subgraph
doesn’t do exactly what we need it to. It removes all unconnected nodes after selection has occurred, whereas what we need to do is re-construct the original graph creating the edges between nodes that were already there.As an example, if we applied the logic proposed here like so:
As you can see we’ve removed nodes 5 and 7 from our graph even though they were selected!
Here’s what we were expecting to happen:
Now interestingly enough that’s not what we get today-- instead we get:
As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow.
Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work.