question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Performance] NUM_OMP_THREADS can have a significant impact on memory usage for large graphs (billions of nodes, edges)

See original GitHub issue

šŸ› Bug

This isn’t necessarily a bug, but more of a tricky observation:

I’ve been stress-testing DGL’s ability to perform random walks on large graphs (eg 1B nodes, 1B edges), and I’ve found that NUM_OMP_THREADS can have a significant impact on memory usage, eg a difference of ~10x peak memory usage difference when explicitly setting NUM_OMP_THREADS=1 or leaving it at its default value.

I suspect that the root cause has to do with this COOToCSR() function (which random_walk() calls, as it evidently needs the graph to be in CSR format): https://github.com/dmlc/dgl/blob/master/src/array/cpu/spmat_op_impl_coo.cc#L301

Note this important line:

// complexity: time O(NNZ), space O(1) if the coo is row sorted,
// time O(NNZ/p + N), space O(NNZ + N*p) otherwise, where p is the number of
// threads.

Here, NNZ is ā€œnumber of graph edgesā€, and N is ā€œnumber of graph nodesā€.

When the graph is large (eg billions of Nodes), that N*p space usage can be brutal, especially if p (number of OMP threads) is high. As one data point, for my graph with 1B nodes, 100M edges, calling random_walk() on this graph resulted in the following peak memory usage:

# explicitly NUM_OMP_THREADS=1
CPU Memory used to store graph (COO+CSR): 17.21 GB
    Peak CPU Memory: 20.72 GB

# leave omp num threads at its default (whatever that is)
CPU Memory used to store graph (COO+CSR): 17.21 GB
    Peak CPU Memory: 204.31 GB

Here, the high ā€œPeak CPU Memoryā€ is (I’m 99% sure) from the memory overhead of doing COOToCSR().

To Reproduce

I don’t have a repro script prepared yet, but I can put one together if people are interested. The methodology is fairly simple: create a random hetgraph with two node types + 1 edge type, create the hetgraph via dgl.heterograph(data_dict), then call dgl.sampling.random_walk() on the graph.

Expected behavior

This isn’t a failure/bug of the DGL system, but perhaps the documentation could be made clearer about the impact of NUM_OMP_THREADS on memory usage for large graphs, and/or more documentation about the importance of sorting the COO rows prior to calling COOToCSR().

Making scalable systems is always tricky, so maybe this ticket will inspire some ideas. Thanks!

Environment

  • DGL Version (e.g., 1.0): A recent nightly build
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): Nightly build
  • Build command you used (if compiling from source):
  • Python version: 3.7
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
Rhett-Yingcommented, Sep 6, 2021

Awesome idea @nv-dlasalle.

In Step6, I think we should compute the prefix sum of P_sum starting from left-top, walk down in column. Adds the last sum of previous column into current column. t*N/P seems not accurate, as counter of each row_idx varies.

In Step8, Each thread t can now construct the CSR rows V[t] for the edges in E_sorted[P_sum[0, t]šŸ˜›_sum[0 ,t+1]].

1reaction
nv-dlasallecommented, Sep 2, 2021

Sorry, in this context M is the number of edges of the graph, so in COO format the memory foot-print is 2M, and in CSR it is N+M.

For the below algorithm sketch, P will be the number of threads (this should be parallel complexity O(M + N + P) and memory complexity of O(M + N + P^2)):

1. A global array of length M is allocated, to store the bucket sorted edges (E_sorted).
2. A global 2d array of dimension P x P is allocated (P_sum).
3. Each thread t is assigned a contiguous chunk of M/P edges (E[t]).
4. Each thread t is assigned a contiguous chunk of N/P nodes (V[t]).
5. Each thread populates the row P_sum[t,*] with the counts sources nodes of the edges in E[t], such that P_sum[t][i] is the number of edges in E[t] with source nodes in V[i].
6. Each thread t computes a prefix sum of the column P_sum[*,t], and adds t*N/P to each entry, such that each entry P_sum[i,j] is the offset into E_sorted of the bucket that thread i should insert all of its edges E[i] with source nodes in V[j].
7. Each thread copies its E[t] to the buckets in E_sorted based on the offsets in P_sum[t,*].
8. Each thread t can now construct the CSR rows V[t] for the edges in E_sorted[P_sum[t,0]:P_sum[t+1,0]], using essentially the serial algorithm for each thread (but offsetting the rowptr values appropriately).
Read more comments on GitHub >

github_iconTop Results From Across the Web

FlashGraph: Processing Billion-Node Graphs on an Array of ...
We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with...
Read more >
Device allows a personal computer to process huge graphs
Consisting of billions of nodes and lines, however, large graphs can reach terabytes in size. The graph data are typically processed inĀ ...
Read more >
SNAP: A General Purpose Network Analysis and Graph ...
SNAP can process massive networks with hundreds of millions of nodes and billions of edges. SNAP offers over 140 different graph algorithmsĀ ...
Read more >
Facebook AI open-sources PyTorch-BigGraph for faster ...
Facebook AI team states that the standard graph embedding methods don't scale ... on large graphs consisting of billions of nodes and edges....
Read more >
Distributed Memory Processing of Very Large Graphs
Big graphs such as social networks or the internet network, ... Our network embedding algorithm can scale to graphs with billions of edges,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found