question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

spspmm raises error in cuda but works well in cpu

See original GitHub issue

🐛 Bug

To Reproduce

The net is similar with Graph UNet, but has only downsample blocks. The Code is


import torch
import torch.nn as nn
import torch_geometric.nn as gnn
from torch_geometric.nn import GCNConv, TopKPooling
from torch_geometric.utils import add_self_loops, sort_edge_index, remove_self_loops
from torch_sparse import spspmm



class GCNConvBnReLu(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = GCNConv(in_channels, out_channels, bias=False, improved=True)
        self.bn = gnn.BatchNorm(out_channels)
        self.relu = nn.ReLU()

    def forward(self, x, edge_index, edge_weight=None):
        x = self.conv(x, edge_index, edge_weight)
        x = self.bn(x)
        x = self.relu(x)
        return x


class MyNet(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, pool_ratios=0.5, depth=3):
        super().__init__()

        channels = in_channels
        self.depth = depth
        self.down_convs = nn.ModuleList()
        self.pools = nn.ModuleList()
        self.down_convs.append(GCNConvBnReLu(channels, hidden_channels))
        self.fc = nn.Linear(hidden_channels, out_channels)

        for i in range(depth):
            self.down_convs.append(GCNConvBnReLu(hidden_channels, hidden_channels))
            self.pools.append(TopKPooling(hidden_channels, ratio=pool_ratios))


    def forward(self, x, edge_index, batch=None):
        depth = self.depth
        edge_weight = x.new_ones(edge_index.size(1))
        x = self.down_convs[0](x, edge_index, edge_weight)

        for i in range(1, depth + 1):
            print(edge_index.shape)
            print(edge_index.min(), edge_index.max(), x.size(0))

            edge_index, edge_weight = self.augment_adj(edge_index, edge_weight, x.size(0))

            print(edge_index.shape)
            print(edge_index.min(), edge_index.max(), x.size(0))
            print('----------')

            x, edge_index, edge_weight, batch, _, _ = self.pools[i-1](x, edge_index, edge_weight, batch)
            x = self.down_convs[i](x, edge_index, edge_weight)

        x = gnn.global_mean_pool(x, batch)
        out = self.fc(x)
        return out

    def augment_adj(self, edge_index, edge_weight, num_nodes):
        edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
        edge_index, edge_weight = add_self_loops(edge_index, edge_weight,
                                                 num_nodes=num_nodes)
        edge_index, edge_weight = sort_edge_index(edge_index, edge_weight,
                                                  num_nodes)
        edge_index, edge_weight = spspmm(edge_index, edge_weight, edge_index,
                                         edge_weight, num_nodes, num_nodes,
                                         num_nodes)
        edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
        return edge_index, edge_weight

Test the MyNet as follow. The test data can be download in Google Drive


device = torch.device('cuda')
model = MyNet(3, 64, 4, 0.5, 3).to(device)
data1 = torch.load('success.pt').to(device)
y1 = model(data1.x, data1.edge_index)
data2 = torch.load('failed.pt').to(device)
y2 = model(data2.x, data2.edge_index)

Expected behavior

The error log is

  File "D:\Software\anaconda3\lib\site-packages\torch_sparse\spspmm.py", line 30, in spspmm
    C = matmul(A, B)
  File "D:\Software\anaconda3\lib\site-packages\torch_sparse\matmul.py", line 125, in matmul
    return spspmm(src, other, reduce)
  File "D:\Software\anaconda3\lib\site-packages\torch_sparse\matmul.py", line 102, in spspmm
    return spspmm_sum(src, other)
  File "D:\Software\anaconda3\lib\site-packages\torch_sparse\matmul.py", line 92, in spspmm_sum
    sparse_sizes=(M, K), is_sorted=True)
  File "D:\Software\anaconda3\lib\site-packages\torch_sparse\tensor.py", line 25, in __init__
    is_sorted=is_sorted)
  File "D:\Software\anaconda3\lib\site-packages\torch_sparse\storage.py", line 70, in __init__
    assert col.max().item() < sparse_sizes[1]
AssertionError

When I set model.eval() or device='cpu', the code works well.

Environment

  • OS: Win10
  • Python version: 3.7
  • PyTorch version: 1.8.1
  • PyG version: 1.7.2
  • CUDA/cuDNN version: 10.2 / 8.0.5
  • GCC version:
  • Any other relevant information:

Additional context

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:13 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
rusty1scommented, Feb 14, 2022

@wrccrwx @KimKyuSik It’s really a bummer that I cannot reproduce this issue. I’m really sorry. I basically followed the instructions from the cusparse documentation for implementing our CUDA routine in spspmm_cuda.cu:

// assume matrices A, B and D are ready.
int baseC, nnzC;
csrgemm2Info_t info = NULL;
size_t bufferSize;
void *buffer = NULL;
// nnzTotalDevHostPtr points to host memory
int *nnzTotalDevHostPtr = &nnzC;
double alpha = -1.0;
double beta  =  1.0;
cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_HOST);

// step 1: create an opaque structure
cusparseCreateCsrgemm2Info(&info);

// step 2: allocate buffer for csrgemm2Nnz and csrgemm2
cusparseDcsrgemm2_bufferSizeExt(handle, m, n, k, &alpha,
    descrA, nnzA, csrRowPtrA, csrColIndA,
    descrB, nnzB, csrRowPtrB, csrColIndB,
    &beta,
    descrD, nnzD, csrRowPtrD, csrColIndD,
    info,
    &bufferSize);
cudaMalloc(&buffer, bufferSize);

// step 3: compute csrRowPtrC
cudaMalloc((void**)&csrRowPtrC, sizeof(int)*(m+1));
cusparseXcsrgemm2Nnz(handle, m, n, k,
        descrA, nnzA, csrRowPtrA, csrColIndA,
        descrB, nnzB, csrRowPtrB, csrColIndB,
        descrD, nnzD, csrRowPtrD, csrColIndD,
        descrC, csrRowPtrC, nnzTotalDevHostPtr,
        info, buffer );
if (NULL != nnzTotalDevHostPtr){
    nnzC = *nnzTotalDevHostPtr;
}else{
    cudaMemcpy(&nnzC, csrRowPtrC+m, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(&baseC, csrRowPtrC, sizeof(int), cudaMemcpyDeviceToHost);
    nnzC -= baseC;
}

// step 4: finish sparsity pattern and value of C
cudaMalloc((void**)&csrColIndC, sizeof(int)*nnzC);
cudaMalloc((void**)&csrValC, sizeof(double)*nnzC);
// Remark: set csrValC to null if only sparsity pattern is required.
cusparseDcsrgemm2(handle, m, n, k, &alpha,
        descrA, nnzA, csrValA, csrRowPtrA, csrColIndA,
        descrB, nnzB, csrValB, csrRowPtrB, csrColIndB,
        &beta,
        descrD, nnzD, csrValD, csrRowPtrD, csrColIndD,
        descrC, csrValC, csrRowPtrC, csrColIndC,
        info, buffer);

// step 5: destroy the opaque structure
cusparseDestroyCsrgemm2Info(info);

Any chance you can debug where our routine crashes by installing torch-sparse from source?

0reactions
KimKyuSikcommented, Feb 13, 2022

I have exact same issue when I use Titan RTX and RTX 3090. Is there any way to solve it?

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyTorch embedding layer raises "expected...cuda...but got ...
This type of error typically occurs when there is a tensor in the model that should be on GPU but is on CPU...
Read more >
How to fix “CUDA error: device-side assert triggered” error?
I use huggingface Transformer to fine-tune a binary classification model. When I do inference job on big data. In rare case, it will...
Read more >
Compile time CUDA device checking in Rust
Trying to perform operations on data on separate CUDA devices in ... this only works for the "cpu" and "cuda" device strings, but...
Read more >
GPUs not detected - RLlib - Ray
Tensorflow works fine with GPUs. ... I tried to remove the part that raised the error, but I noticed that the trainer used...
Read more >
"CUDA error" after 12 hrs and 38% training on large ...
"CUDA error" after 12 hrs and 38% training on large Language model ... I used %debug and the debugger seemed to work fine,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found