Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project

See original GitHub issue

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build a single model itself takes 15 mins. Have disabled analytics tracking event thought that could have caused but still no luck. However, dbt run is faster but I can’t go with just run as tests required immediately before proceed to next model in DAG.

Expected Behavior

Models could be more depends project size but building single model not supposed to take 15 mins and it does not look realistic when we have multiple dbt commands to run.

Steps To Reproduce

version: dbt 1.0 till latest release 1.3

total nodes count: 3K+ including models, tests, snapshots, seeds

build single model: $dbt build -s test_model

Relevant log output

No response

Environment

- OS: Windows/DBT Cloud
- Python: 3.7.4
- dbt: 1.2

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

Issue Analytics

State:
Created a year ago
Comments:12 (6 by maintainers)

Top GitHub Comments

2reactions

china-csecommented, Oct 16, 2022

thank you @jtcohen6 for the response and references to ongoing discussions.

Have gone through code snippet where its taking time to build graph… I’m intrigued to see below block implementation where its removing all nodes/edges instead of just taking subgraph using networkx module. Is there any specific reason its implemented in such way using loops and going through each nodes check and remove.

As per my understanding of the same logic, it can be achieved through just using subgraph like below… It just faster and simpler code. have tested this, and it works faster and achieves same result. Would this be amended with logic or you see any issues?

Thank you!!!

1reaction

iknox-facommented, Oct 31, 2022

Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case graph.subgraph doesn’t do exactly what we need it to. It removes all unconnected nodes after selection has occurred, whereas what we need to do is re-construct the original graph creating the edges between nodes that were already there.

As an example, if we applied the logic proposed here like so:

>>> import networkx as nx
>>> G = nx.path_graph(8)
>>> list(G.edges)
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7)]
>>> H = G.subgraph([1,2,3,5,7])
>>> list(H.edges)
[(1, 2), (2, 3)]

As you can see we’ve removed nodes 5 and 7 from our graph even though they were selected!

Here’s what we were expecting to happen:

>>> I = Graph(nx.DiGraph(G))
<dbt.graph.graph.Graph object at 0x10f3a9940>
>>> J = I.get_subset_graph([1,2,3,5,7])
>>> list(J.graph.edges)
[(1, 2), (2, 3), (3, 5), (5, 7)]

Now interestingly enough that’s not what we get today-- instead we get:

[(1, 2), (2, 1), (2, 3), (3, 2), (3, 5), (5, 3), (5, 7), (7, 5)]

As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow.

Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work.