question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When using uva, CUDA error occur at tensor.to(device)

See original GitHub issue

🐛 Bug

When try to perform inference using uva, several errors will occurred. I try the code for several times, and each time I got different result.

To Reproduce

import dgl
import torch
import torch.nn.functional as F
import torch.nn as nn
import gc

import dgl
from dgl.nn import GATConv
from ogb.nodeproppred import DglNodePropPredDataset
from dgl.data import CiteseerGraphDataset, RedditDataset
from dgl.nn import GraphConv

class GAT(nn.Module):
    def __init__(self,
                 num_layers,
                 in_dim,
                 num_hidden,
                 num_classes,
                 heads,
                 activation,
                 feat_drop,
                 attn_drop,
                 negative_slope,
                 residual):
        super(GAT, self).__init__()
        self.num_layers = num_layers
        self.gat_layers = nn.ModuleList()
        self.activation = activation
        self.hidden_features = num_hidden
        self.heads = heads
        self.out_features = num_classes
        # input projection (no residual)
        self.gat_layers.append(GATConv(
            in_dim, num_hidden, heads[0],
            feat_drop, attn_drop, negative_slope, False, self.activation, allow_zero_in_degree=True))
        # hidden layers
        for l in range(1, num_layers - 1):
            # due to multi-head, the in_dim = num_hidden * num_heads
            self.gat_layers.append(GATConv(
                num_hidden * heads[l-1], num_hidden, heads[l],
                feat_drop, attn_drop, negative_slope, residual, self.activation, allow_zero_in_degree=True))
        # output projection
        self.gat_layers.append(GATConv(
            num_hidden * heads[-2], num_classes, heads[-1],
            feat_drop, attn_drop, negative_slope, residual, None, allow_zero_in_degree=True))

    def forward(self, g, inputs):
        h = inputs
        for l in range(self.num_layers - 1):
            h = self.gat_layers[l](g[l], h).flatten(1)
        # output projection
        logits = self.gat_layers[-1](g[-1], h).mean(1)
        return logits

    def forward_full(self, g, inputs):
        h = inputs
        for l in range(self.num_layers - 1):
            h = self.gat_layers[l](g, h).flatten(1)
        # output projection
        logits = self.gat_layers[-1](g, h).mean(1)
        return logits

    def inference(self, g, batch_size, device, x):
        torch.cuda.reset_peak_memory_stats()
        for l, layer in enumerate(self.gat_layers):
            gc.collect()
            torch.cuda.empty_cache()
            if l != self.num_layers - 1:
                y = torch.zeros(g.number_of_nodes(), self.heads[l] * self.hidden_features)
            else:
                y = torch.zeros(g.number_of_nodes(), self.out_features)
            g.ndata['feat'] = x
            sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1, prefetch_node_feats=['feat'])
            dataloader = dgl.dataloading.NodeDataLoader(
                g, torch.arange(g.number_of_nodes()).to(device), sampler,
                batch_size=batch_size,
                shuffle=False,
                drop_last=False,
                use_uva=True,
                device=device,
                num_workers=0)
            for input_nodes, output_nodes, blocks in dataloader:
                torch.cuda.reset_peak_memory_stats()
                torch.cuda.empty_cache()
                block = blocks[0].to(device)
                h = block.srcdata['feat']
                h = h.to(device)

                h = layer(block, h)
                if l == self.num_layers - 1:
                    logits = h.mean(1)
                    y[output_nodes] = logits.cpu()
                else:
                    h = h.flatten(1)
                    y[output_nodes] = h.cpu()
        return y


def load_reddit():
    data = RedditDataset(self_loop=True)
    g = data[0]
    g.ndata['features'] = g.ndata['feat']
    return g, data.num_classes

if __name__ == '__main__':
    dataset = load_reddit()
    g : dgl.DGLHeteroGraph = dataset[0]
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    feat = g.ndata['feat']
    labels = g.ndata['label']
    num_classes = dataset[1]
    in_feats = feat.shape[1]
    train_nid = torch.nonzero(train_mask, as_tuple=True)[0]
    hidden_feature = 128

    sampler = dgl.dataloading.MultiLayerNeighborSampler([10, 25, 50])
    dataloader = dgl.dataloading.NodeDataLoader(
        g, train_nid, sampler,
        batch_size=2000,
        shuffle=True,
        drop_last=False,
        num_workers=4)

    model = GAT(3, in_feats, hidden_feature, num_classes, [2, 2, 2], F.relu, 0.5, 0.5, 0.5, 0.5)

    device = "cuda:0"
    model = model.to(torch.device(device))
    opt = torch.optim.Adam(model.parameters())
    loss_fcn = nn.CrossEntropyLoss()

    for epoch in range(1):
        for input_nodes, output_nodes, blocks in dataloader:
            blocks = [b.to(torch.device(device)) for b in blocks]
            input_features = feat[input_nodes].to(torch.device(device))
            pred = model(blocks, input_features)
            output_labels = labels[output_nodes].to(torch.device(device))
            loss = loss_fcn(pred, output_labels)
            opt.zero_grad()
            loss.backward()
            opt.step()
            break

    with torch.no_grad():
        pred = model.inference(g, 10000, torch.device(device), feat)
        func_score = (torch.argmax(pred, dim=1) == labels).float().sum() / len(pred)

Steps to reproduce the behavior:

Sometime I can run the code successfully, but sometime error will occur. This is the most common error I met, and it happened in h.cpu() or h.to(device).

Traceback (most recent call last):
  File "/home/ec2-user/inference_helper/bug.py", line 146, in <module>
    pred = model.inference(g, 10000, torch.device(device), feat)
  File "/home/ec2-user/inference_helper/bug.py", line 95, in inference
    y[output_nodes] = h.cpu()
RuntimeError: CUDA error: invalid argument

Sometime this error may occur:

Traceback (most recent call last):
  File "/home/ec2-user/inference_helper/bug.py", line 146, in <module>
    pred = model.inference(g, 10000, torch.device(device), feat)
  File "/home/ec2-user/inference_helper/bug.py", line 89, in inference
    h = layer(block, h)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/dgl-0.9-py3.9-linux-x86_64.egg/dgl/nn/pytorch/conv/gatconv.py", line 282, in forward
    feat_src = feat_dst = self.fc(h_src).view(
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (40736x602 and 256x256)

Expected behavior

Environment

  • DGL Version (e.g., 1.0): 0.9
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.10
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): source
  • Build command you used (if compiling from source):
  • Python version: 3.9.4
  • CUDA/cuDNN version (if applicable): 11.3
  • GPU models and configuration (e.g. V100): Tesla V4
  • Any other relevant information: g4dn.8xlarge instance

Additional context

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
yinpeiqicommented, Apr 12, 2022

Thanks! I think you are right. I remove all ndata/edata before pin the graph and it works without that error. I tried for many times and it seems that error not happen anymore.

for k in list(g.ndata.keys()):
    g.ndata.pop(k)
for k in list(g.edata.keys()):
    g.edata.pop(k)
0reactions
yaox12commented, Apr 12, 2022

I ran your script several times and find that sometimes it will fail in pinning the graph and raise the error CUDA: part or all of the requested memory range is already mapped. When you set use_uva=True in the dataloader, it will pin the graph as well the ndata/edata. I guess this error may be caused by that you try to pin the graph and ndata/edata when they have not been totally unpinned. That’s why removing g.unpin_memory_() works I think. After removing it, I ran your script 10 times and it didn’t throw an error. If you still have the issue, you can try adding torch.cuda.synchronize() below unpin_memory_inplace(x).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Guide 3: Debugging in PyTorch - UvA DL Notebooks
This error occurs if you re-use a tensor from the computation graph of the previous batch. This should usually not happen. Make sure...
Read more >
Moving a tensor to cuda device cause illegal memory access ...
I am trying the following snippet in Colab but causes the following error. Is it wrong to move a tensor object to Cuda...
Read more >
6.16. Unified Addressing - NVIDIA Documentation Center
CUDA devices can share a unified address space with the host. ... The start address and end address of the memory range will...
Read more >
CUDA Python 12.0.0 documentation - GitHub Pages
Device supports accessing memory using Tensor Map. ... This indicates that an async error has occurred in a device outside of CUDA. If...
Read more >
dgl.dataloading.dataloader — DGL 0.9.1post1 documentation
(2) each LazyFeature object is replaced with a tensor or future. ... True if the graph is on CPU, :attr:`device` is CUDA, and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found