[Bug] CUDA stream synchronization issue between pytorch and DGL internal functions
See original GitHub issueCUDA stream support tracker
- Make TensorAdaptor stream-aware. #4470 #4472
- Clean the improper CUDA stream usage in the core library, including CUB and many hard-coded default stream (0 or nullptr). #4471 #4480
- Add stream-related utilities to DGL, e.g.,
.record_streamfor dgl.graph #4467, always use PyTorch’s CUDA stream when it’s available, etc. #4503
🐛 Bug
Update (08/19/2022):
When using a non-default stream for all torch operations (torch.cuda.set_stream()), there (DGL) lacks synchronizations in the core library (which always uses the default stream) if there is data-dependency.
Under certain conditions like (1) using a non-default stream and (2) with torch.no_grad(), a device assertion failure is occurred when calling SpMM() function.
After enabling CUDA_LAUNCH_BLOCKING=1, the assertion failure is disappeared.
I think this is caused by using different cuda streams for data-transferring and DGL internal functions, while lack of stream synchronizations. Please see investigation section below.
To Reproduce
Run the following python script:
import torch
import dgl
import torch.nn as nn
from dgl.nn.pytorch import RelGraphConv
class GraphConv(nn.Module):
def __init__(self, in_channels, out_channels, num_edge_type=3):
super(GraphConv, self).__init__()
self.conv = RelGraphConv(in_channels, out_channels, num_edge_type)
self.in_ch = in_channels
self.out_ch = out_channels
def forward(self, dgl_graph, e_type, input):
res = self.conv(dgl_graph, input, e_type)
return res
device = 0
# use a non-default stream
s = torch.cuda.Stream()
torch.cuda.set_stream(s)
# input_dim: 10, out_dim: 256,
model = GraphConv(10, 256).to(device)
# torch.no_grad() is necessary to reproduce the issue
with torch.no_grad():
# graph: 10240 nodes, 409600 edges
# generate src/dst node ids for 409600 edges
src = torch.randint(10240, (409600,))
dst = torch.randint(10240, (409600,))
# generate dgl graph and find # of nodes
dgl_graph = dgl.graph(data=(src, dst)).to(device)
n_node = dgl_graph.num_nodes()
# edge type id: 0, 1, 2 for all 409600 edges
e_type = torch.randint(3, (409600,)).to(device)
# run multiple forward passes to reveal the issue (critical!!!)
node_feat = torch.rand(n_node, 10).to(device)
for i in range(10):
_ = model(dgl_graph, e_type, node_feat)
Sample error message
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [373,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [373,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "entity.py", line 41, in <module>
h = model(dgl_graph, e_type, node_feat)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "entity.py", line 14, in forward
res = self.conv(dgl_graph, input, e_type)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/dgl/nn/pytorch/conv/relgraphconv.py", line 167, in forward
g.update_all(self.message, fn.sum('m', 'h'))
File "/opt/conda/lib/python3.8/site-packages/dgl/heterograph.py", line 4895, in update_all
ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
File "/opt/conda/lib/python3.8/site-packages/dgl/core.py", line 365, in message_passing
msgdata = invoke_edge_udf(g, ALL, g.canonical_etypes[0], mfunc, orig_eid=orig_eid)
File "/opt/conda/lib/python3.8/site-packages/dgl/core.py", line 85, in invoke_edge_udf
return func(ebatch)
File "/opt/conda/lib/python3.8/site-packages/dgl/nn/pytorch/conv/relgraphconv.py", line 131, in message
m = self.linear_r(edges.src['h'], edges.data['etype'], self.presorted)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/dgl/nn/pytorch/linear.py", line 174, in forward
return gather_mm(x, w, idx_b=x_type)
File "/opt/conda/lib/python3.8/site-packages/dgl/ops/gather_mm.py", line 38, in gather_mm
pos_r = torch.cat([pos_l[1:], torch.tensor([len(idx_b)], device=a.device)])
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Environment
- DGL Version (e.g., 1.0): 0.9
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): Pytorch 1.13
- How you installed DGL (
conda,pip, source): source - CUDA/cuDNN version (if applicable): 11.6
- GPU models and configuration (e.g. V100): A5000
Investigation
Several observations:
Issue Analytics
- State:
- Created a year ago
- Comments:13 (3 by maintainers)
Top Results From Across the Web
Support multiple cuda streams · Issue #2974 · dmlc/dgl - GitHub
There is a stream associated with each thread via ... [Bug] CUDA stream synchronization issue between pytorch and DGL internal functions # ...
Read more >CUDA Stream Sanitizer — PyTorch 1.13 documentation
This module introduces CUDA Sanitizer, a tool for detecting synchronization errors between kernels ran on different streams. It stores information on ...
Read more >Using the NVIDIA CUDA Stream-Ordered Memory Allocator ...
This post introduces new API functions that enable memory allocation and deallocation to be stream-ordered operations.
Read more >Strange runtime errors for SDDMM kernel - Deep Graph Library
This bugs seems to be pretty random. It can happen at the beginning of the iteration or in the middle of the iteration....
Read more >How to solve the famous `unhandled cuda error, NCCL ...
I believe running torch.cuda.set_device(opts.gpu) # https://github.com/pytorch/pytorch/issues/54550 will work if you run ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

I think DGL’s core library doesn’t fully support CUDA streams yet. For now, I suggest limiting the usage of non-default streams in data loading, and carefully waiting/syncing with the main stream. @isratnisa @nv-dlasalle @BarclayII @jermainewang Please comment.
Regarding the above piece of code, though it doesn’t throw an error when removing
torch.no_grad(), the result could be incorrect because the computing kernelSpMMCsrKernelis executed before theindexSelectedkernel finished.@chang-l Thanks for your investigation, nice caught! The memory footprint was increased by calling
OPS.copy_u_sum(g, x)because it will materialize other sparse formats. So a simple solution is to call.create_formats()forgandg3. I will send a PR to fix it.