Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Performance] NUM_OMP_THREADS can have a significant impact on memory usage for large graphs (billions of nodes, edges)

See original GitHub issue

🐛 Bug

This isn’t necessarily a bug, but more of a tricky observation:

I’ve been stress-testing DGL’s ability to perform random walks on large graphs (eg 1B nodes, 1B edges), and I’ve found that NUM_OMP_THREADS can have a significant impact on memory usage, eg a difference of ~10x peak memory usage difference when explicitly setting NUM_OMP_THREADS=1 or leaving it at its default value.

I suspect that the root cause has to do with this COOToCSR() function (which random_walk() calls, as it evidently needs the graph to be in CSR format): https://github.com/dmlc/dgl/blob/master/src/array/cpu/spmat_op_impl_coo.cc#L301

Note this important line:

// complexity: time O(NNZ), space O(1) if the coo is row sorted,
// time O(NNZ/p + N), space O(NNZ + N*p) otherwise, where p is the number of
// threads.

Here, NNZ is “number of graph edges”, and N is “number of graph nodes”.

When the graph is large (eg billions of Nodes), that N*p space usage can be brutal, especially if p (number of OMP threads) is high. As one data point, for my graph with 1B nodes, 100M edges, calling random_walk() on this graph resulted in the following peak memory usage:

# explicitly NUM_OMP_THREADS=1
CPU Memory used to store graph (COO+CSR): 17.21 GB
    Peak CPU Memory: 20.72 GB

# leave omp num threads at its default (whatever that is)
CPU Memory used to store graph (COO+CSR): 17.21 GB
    Peak CPU Memory: 204.31 GB

Here, the high “Peak CPU Memory” is (I’m 99% sure) from the memory overhead of doing COOToCSR().

To Reproduce

I don’t have a repro script prepared yet, but I can put one together if people are interested. The methodology is fairly simple: create a random hetgraph with two node types + 1 edge type, create the hetgraph via dgl.heterograph(data_dict), then call dgl.sampling.random_walk() on the graph.

Expected behavior

This isn’t a failure/bug of the DGL system, but perhaps the documentation could be made clearer about the impact of NUM_OMP_THREADS on memory usage for large graphs, and/or more documentation about the importance of sorting the COO rows prior to calling COOToCSR().

Making scalable systems is always tricky, so maybe this ticket will inspire some ideas. Thanks!

Environment

DGL Version (e.g., 1.0): A recent nightly build
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): Nightly build
Build command you used (if compiling from source):
Python version: 3.7
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

Rhett-Yingcommented, Sep 6, 2021

Awesome idea @nv-dlasalle.

In Step6, I think we should compute the prefix sum of P_sum starting from left-top, walk down in column. Adds the last sum of previous column into current column. t*N/P seems not accurate, as counter of each row_idx varies.

In Step8, Each thread t can now construct the CSR rows V[t] for the edges in E_sorted[P_sum[0, t]😛_sum[0 ,t+1]].

1reaction

nv-dlasallecommented, Sep 2, 2021

Sorry, in this context M is the number of edges of the graph, so in COO format the memory foot-print is 2M, and in CSR it is N+M.

For the below algorithm sketch, P will be the number of threads (this should be parallel complexity O(M + N + P) and memory complexity of O(M + N + P^2)):

1. A global array of length M is allocated, to store the bucket sorted edges (E_sorted).
2. A global 2d array of dimension P x P is allocated (P_sum).
3. Each thread t is assigned a contiguous chunk of M/P edges (E[t]).
4. Each thread t is assigned a contiguous chunk of N/P nodes (V[t]).
5. Each thread populates the row P_sum[t,*] with the counts sources nodes of the edges in E[t], such that P_sum[t][i] is the number of edges in E[t] with source nodes in V[i].
6. Each thread t computes a prefix sum of the column P_sum[*,t], and adds t*N/P to each entry, such that each entry P_sum[i,j] is the offset into E_sorted of the bucket that thread i should insert all of its edges E[i] with source nodes in V[j].
7. Each thread copies its E[t] to the buckets in E_sorted based on the offsets in P_sum[t,*].
8. Each thread t can now construct the CSR rows V[t] for the edges in E_sorted[P_sum[t,0]:P_sum[t+1,0]], using essentially the serial algorithm for each thread (but offsetting the rowptr values appropriately).

Top Results From Across the Web

FlashGraph: Processing Billion-Node Graphs on an Array of ...

We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with...

Device allows a personal computer to process huge graphs

Consisting of billions of nodes and lines, however, large graphs can reach terabytes in size. The graph data are typically processed in ...

SNAP: A General Purpose Network Analysis and Graph ...

SNAP can process massive networks with hundreds of millions of nodes and billions of edges. SNAP offers over 140 different graph algorithms ...

Facebook AI open-sources PyTorch-BigGraph for faster ...

Facebook AI team states that the standard graph embedding methods don't scale ... on large graphs consisting of billions of nodes and edges....

Distributed Memory Processing of Very Large Graphs

Big graphs such as social networks or the internet network, ... Our network embedding algorithm can scale to graphs with billions of edges,...