Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] CUDA stream synchronization issue between pytorch and DGL internal functions

See original GitHub issue

CUDA stream support tracker

Make TensorAdaptor stream-aware. #4470 #4472
Clean the improper CUDA stream usage in the core library, including CUB and many hard-coded default stream (0 or nullptr). #4471 #4480
Add stream-related utilities to DGL, e.g., .record_stream for dgl.graph #4467, always use PyTorch’s CUDA stream when it’s available, etc. #4503

🐛 Bug

Update (08/19/2022):

When using a non-default stream for all torch operations (torch.cuda.set_stream()), there (DGL) lacks synchronizations in the core library (which always uses the default stream) if there is data-dependency.

Under certain conditions like (1) using a non-default stream and (2) with torch.no_grad(), a device assertion failure is occurred when calling SpMM() function.

After enabling CUDA_LAUNCH_BLOCKING=1, the assertion failure is disappeared.

I think this is caused by using different cuda streams for data-transferring and DGL internal functions, while lack of stream synchronizations. Please see investigation section below.

To Reproduce

Run the following python script:

import torch
import dgl
import torch.nn as nn
from dgl.nn.pytorch import RelGraphConv

class GraphConv(nn.Module):
    def __init__(self, in_channels, out_channels, num_edge_type=3):
        super(GraphConv, self).__init__()
        self.conv = RelGraphConv(in_channels, out_channels, num_edge_type)
        self.in_ch = in_channels
        self.out_ch = out_channels

    def forward(self, dgl_graph, e_type, input):
        res = self.conv(dgl_graph, input, e_type)
        return res

device = 0
# use a non-default stream
s = torch.cuda.Stream()
torch.cuda.set_stream(s)
# input_dim: 10, out_dim: 256,
model = GraphConv(10, 256).to(device)

# torch.no_grad() is necessary to reproduce the issue
with torch.no_grad():
    # graph: 10240 nodes, 409600 edges
    # generate src/dst node ids for 409600 edges
    src = torch.randint(10240, (409600,))
    dst = torch.randint(10240, (409600,))

    # generate dgl graph and find # of nodes
    dgl_graph = dgl.graph(data=(src, dst)).to(device)
    n_node = dgl_graph.num_nodes()

    # edge type id: 0, 1, 2 for all 409600 edges
    e_type = torch.randint(3, (409600,)).to(device)

    # run multiple forward passes to reveal the issue (critical!!!)
    node_feat = torch.rand(n_node, 10).to(device)
    for i in range(10):
        _ = model(dgl_graph, e_type, node_feat)

Sample error message

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [373,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [373,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "entity.py", line 41, in <module>
    h = model(dgl_graph, e_type, node_feat)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "entity.py", line 14, in forward
    res = self.conv(dgl_graph, input, e_type)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dgl/nn/pytorch/conv/relgraphconv.py", line 167, in forward
    g.update_all(self.message, fn.sum('m', 'h'))
  File "/opt/conda/lib/python3.8/site-packages/dgl/heterograph.py", line 4895, in update_all
    ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
  File "/opt/conda/lib/python3.8/site-packages/dgl/core.py", line 365, in message_passing
    msgdata = invoke_edge_udf(g, ALL, g.canonical_etypes[0], mfunc, orig_eid=orig_eid)
  File "/opt/conda/lib/python3.8/site-packages/dgl/core.py", line 85, in invoke_edge_udf
    return func(ebatch)
  File "/opt/conda/lib/python3.8/site-packages/dgl/nn/pytorch/conv/relgraphconv.py", line 131, in message
    m = self.linear_r(edges.src['h'], edges.data['etype'], self.presorted)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dgl/nn/pytorch/linear.py", line 174, in forward
    return gather_mm(x, w, idx_b=x_type)
  File "/opt/conda/lib/python3.8/site-packages/dgl/ops/gather_mm.py", line 38, in gather_mm
    pos_r = torch.cat([pos_l[1:], torch.tensor([len(idx_b)], device=a.device)])
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Environment

DGL Version (e.g., 1.0): 0.9
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): Pytorch 1.13
How you installed DGL (conda, pip, source): source
CUDA/cuDNN version (if applicable): 11.6
GPU models and configuration (e.g. V100): A5000

Investigation

Several observations:

Use default stream for pytorch avoids the crash For example, comment out s = torch.cuda.Stream(); torch.cuda.set_stream(s) in the above script.
Apply sync after torch.index_select can fix this issue, as here
Apply sync prior to _CAPI_DGLKernelSpMM also works, as here

Issue Analytics

State:
Created a year ago
Comments:13 (3 by maintainers)

Top GitHub Comments

2reactions

yaox12commented, Aug 22, 2022

I think DGL’s core library doesn’t fully support CUDA streams yet. For now, I suggest limiting the usage of non-default streams in data loading, and carefully waiting/syncing with the main stream. @isratnisa @nv-dlasalle @BarclayII @jermainewang Please comment.

Regarding the above piece of code, though it doesn’t throw an error when removing torch.no_grad(), the result could be incorrect because the computing kernel SpMMCsrKernel is executed before the indexSelected kernel finished.

1reaction

yaox12commented, Sep 26, 2022

@chang-l Thanks for your investigation, nice caught! The memory footprint was increased by calling OPS.copy_u_sum(g, x) because it will materialize other sparse formats. So a simple solution is to call .create_formats() for g and g3. I will send a PR to fix it.

Top Results From Across the Web

Support multiple cuda streams · Issue #2974 · dmlc/dgl - GitHub

There is a stream associated with each thread via ... [Bug] CUDA stream synchronization issue between pytorch and DGL internal functions # ...

CUDA Stream Sanitizer — PyTorch 1.13 documentation

This module introduces CUDA Sanitizer, a tool for detecting synchronization errors between kernels ran on different streams. It stores information on ...

Using the NVIDIA CUDA Stream-Ordered Memory Allocator ...

This post introduces new API functions that enable memory allocation and deallocation to be stream-ordered operations.

Strange runtime errors for SDDMM kernel - Deep Graph Library

This bugs seems to be pretty random. It can happen at the beginning of the iteration or in the middle of the iteration....

How to solve the famous `unhandled cuda error, NCCL ...

I believe running torch.cuda.set_device(opts.gpu) # https://github.com/pytorch/pytorch/issues/54550 will work if you run ...