[Performance] NUM_OMP_THREADS can have a significant impact on memory usage for large graphs (billions of nodes, edges)
See original GitHub issueš Bug
This isnāt necessarily a bug, but more of a tricky observation:
Iāve been stress-testing DGLās ability to perform random walks on large graphs (eg 1B nodes, 1B edges), and Iāve found that NUM_OMP_THREADS
can have a significant impact on memory usage, eg a difference of ~10x peak memory usage difference when explicitly setting NUM_OMP_THREADS=1
or leaving it at its default value.
I suspect that the root cause has to do with this COOToCSR()
function (which random_walk() calls, as it evidently needs the graph to be in CSR format):
https://github.com/dmlc/dgl/blob/master/src/array/cpu/spmat_op_impl_coo.cc#L301
Note this important line:
// complexity: time O(NNZ), space O(1) if the coo is row sorted,
// time O(NNZ/p + N), space O(NNZ + N*p) otherwise, where p is the number of
// threads.
Here, NNZ
is ānumber of graph edgesā, and N
is ānumber of graph nodesā.
When the graph is large (eg billions of Nodes), that N*p
space usage can be brutal, especially if p
(number of OMP threads) is high.
As one data point, for my graph with 1B nodes, 100M edges, calling random_walk() on this graph resulted in the following peak memory usage:
# explicitly NUM_OMP_THREADS=1
CPU Memory used to store graph (COO+CSR): 17.21 GB
Peak CPU Memory: 20.72 GB
# leave omp num threads at its default (whatever that is)
CPU Memory used to store graph (COO+CSR): 17.21 GB
Peak CPU Memory: 204.31 GB
Here, the high āPeak CPU Memoryā is (Iām 99% sure) from the memory overhead of doing COOToCSR().
To Reproduce
I donāt have a repro script prepared yet, but I can put one together if people are interested.
The methodology is fairly simple: create a random hetgraph with two node types + 1 edge type, create the hetgraph via dgl.heterograph(data_dict)
, then call dgl.sampling.random_walk()
on the graph.
Expected behavior
This isnāt a failure/bug of the DGL system, but perhaps the documentation could be made clearer about the impact of NUM_OMP_THREADS on memory usage for large graphs, and/or more documentation about the importance of sorting the COO rows prior to calling COOToCSR().
Making scalable systems is always tricky, so maybe this ticket will inspire some ideas. Thanks!
Environment
- DGL Version (e.g., 1.0): A recent nightly build
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
- OS (e.g., Linux): Linux
- How you installed DGL (
conda
,pip
, source): Nightly build - Build command you used (if compiling from source):
- Python version: 3.7
- CUDA/cuDNN version (if applicable):
- GPU models and configuration (e.g. V100):
- Any other relevant information:
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Awesome idea @nv-dlasalle.
In Step6, I think we should compute the prefix sum of P_sum starting from left-top, walk down in column. Adds the last sum of previous column into current column. t*N/P seems not accurate, as counter of each row_idx varies.
In Step8, Each thread t can now construct the CSR rows V[t] for the edges in E_sorted[P_sum[0, t]š_sum[0 ,t+1]].
Sorry, in this context
M
is the number of edges of the graph, so in COO format the memory foot-print is2M
, and in CSR it isN+M
.For the below algorithm sketch,
P
will be the number of threads (this should be parallel complexity O(M + N + P) and memory complexity of O(M + N + P^2)):