question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project

See original GitHub issue

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build a single model itself takes 15 mins. Have disabled analytics tracking event thought that could have caused but still no luck. However, dbt run is faster but I can’t go with just run as tests required immediately before proceed to next model in DAG.

Expected Behavior

Models could be more depends project size but building single model not supposed to take 15 mins and it does not look realistic when we have multiple dbt commands to run.

Steps To Reproduce

version: dbt 1.0 till latest release 1.3

total nodes count: 3K+ including models, tests, snapshots, seeds

build single model: $dbt build -s test_model

Relevant log output

No response

Environment

- OS: Windows/DBT Cloud
- Python: 3.7.4
- dbt: 1.2

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
china-csecommented, Oct 16, 2022

thank you @jtcohen6 for the response and references to ongoing discussions.

Have gone through code snippet where its taking time to build graph… I’m intrigued to see below block implementation where its removing all nodes/edges instead of just taking subgraph using networkx module. Is there any specific reason its implemented in such way using loops and going through each nodes check and remove.

image

As per my understanding of the same logic, it can be achieved through just using subgraph like below… It just faster and simpler code. have tested this, and it works faster and achieves same result. Would this be amended with logic or you see any issues?

image

Thank you!!!

1reaction
iknox-facommented, Oct 31, 2022

Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case graph.subgraph doesn’t do exactly what we need it to. It removes all unconnected nodes after selection has occurred, whereas what we need to do is re-construct the original graph creating the edges between nodes that were already there.

As an example, if we applied the logic proposed here like so:

>>> import networkx as nx
>>> G = nx.path_graph(8)
>>> list(G.edges)
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7)]
>>> H = G.subgraph([1,2,3,5,7])
>>> list(H.edges)
[(1, 2), (2, 3)]

As you can see we’ve removed nodes 5 and 7 from our graph even though they were selected!

Here’s what we were expecting to happen:

>>> I = Graph(nx.DiGraph(G))
<dbt.graph.graph.Graph object at 0x10f3a9940>
>>> J = I.get_subset_graph([1,2,3,5,7])
>>> list(J.graph.edges)
[(1, 2), (2, 3), (3, 5), (5, 7)]

Now interestingly enough that’s not what we get today-- instead we get:

[(1, 2), (2, 1), (2, 3), (3, 2), (3, 5), (5, 3), (5, 7), (7, 5)]

As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow.

Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Debugging errors - dbt Developer Hub
The error message dbt produces will normally contain the type of error (more on these error types below), and the file where the...
Read more >
Best practice guides - dbt Developer Hub
Best practices. Best practice guides. Learn how dbt Labs approaches building projects through our current viewpoints on structure, style, and setup.
Read more >
Analyzing Fishtown's dbt project performance with artifacts
json : a full representation of your dbt project's resources (models, tests, macros, etc), including all node configurations and resource ...
Read more >
Using dbt artifacts to track project performance - Show and Tell
I wanted to provide a more in depth guide to how we implemented using run_results with Dagster and Snowflake (more than I could...
Read more >
[CT-1353] [Bug] Performance issues with DBT build when nodes ...
In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found