Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When using uva, CUDA error occur at tensor.to(device)

See original GitHub issue

🐛 Bug

When try to perform inference using uva, several errors will occurred. I try the code for several times, and each time I got different result.

To Reproduce

import dgl
import torch
import torch.nn.functional as F
import torch.nn as nn
import gc

import dgl
from dgl.nn import GATConv
from ogb.nodeproppred import DglNodePropPredDataset
from dgl.data import CiteseerGraphDataset, RedditDataset
from dgl.nn import GraphConv

class GAT(nn.Module):
    def __init__(self,
                 num_layers,
                 in_dim,
                 num_hidden,
                 num_classes,
                 heads,
                 activation,
                 feat_drop,
                 attn_drop,
                 negative_slope,
                 residual):
        super(GAT, self).__init__()
        self.num_layers = num_layers
        self.gat_layers = nn.ModuleList()
        self.activation = activation
        self.hidden_features = num_hidden
        self.heads = heads
        self.out_features = num_classes
        # input projection (no residual)
        self.gat_layers.append(GATConv(
            in_dim, num_hidden, heads[0],
            feat_drop, attn_drop, negative_slope, False, self.activation, allow_zero_in_degree=True))
        # hidden layers
        for l in range(1, num_layers - 1):
            # due to multi-head, the in_dim = num_hidden * num_heads
            self.gat_layers.append(GATConv(
                num_hidden * heads[l-1], num_hidden, heads[l],
                feat_drop, attn_drop, negative_slope, residual, self.activation, allow_zero_in_degree=True))
        # output projection
        self.gat_layers.append(GATConv(
            num_hidden * heads[-2], num_classes, heads[-1],
            feat_drop, attn_drop, negative_slope, residual, None, allow_zero_in_degree=True))

    def forward(self, g, inputs):
        h = inputs
        for l in range(self.num_layers - 1):
            h = self.gat_layers[l](g[l], h).flatten(1)
        # output projection
        logits = self.gat_layers[-1](g[-1], h).mean(1)
        return logits

    def forward_full(self, g, inputs):
        h = inputs
        for l in range(self.num_layers - 1):
            h = self.gat_layers[l](g, h).flatten(1)
        # output projection
        logits = self.gat_layers[-1](g, h).mean(1)
        return logits

    def inference(self, g, batch_size, device, x):
        torch.cuda.reset_peak_memory_stats()
        for l, layer in enumerate(self.gat_layers):
            gc.collect()
            torch.cuda.empty_cache()
            if l != self.num_layers - 1:
                y = torch.zeros(g.number_of_nodes(), self.heads[l] * self.hidden_features)
            else:
                y = torch.zeros(g.number_of_nodes(), self.out_features)
            g.ndata['feat'] = x
            sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1, prefetch_node_feats=['feat'])
            dataloader = dgl.dataloading.NodeDataLoader(
                g, torch.arange(g.number_of_nodes()).to(device), sampler,
                batch_size=batch_size,
                shuffle=False,
                drop_last=False,
                use_uva=True,
                device=device,
                num_workers=0)
            for input_nodes, output_nodes, blocks in dataloader:
                torch.cuda.reset_peak_memory_stats()
                torch.cuda.empty_cache()
                block = blocks[0].to(device)
                h = block.srcdata['feat']
                h = h.to(device)

                h = layer(block, h)
                if l == self.num_layers - 1:
                    logits = h.mean(1)
                    y[output_nodes] = logits.cpu()
                else:
                    h = h.flatten(1)
                    y[output_nodes] = h.cpu()
        return y


def load_reddit():
    data = RedditDataset(self_loop=True)
    g = data[0]
    g.ndata['features'] = g.ndata['feat']
    return g, data.num_classes

if __name__ == '__main__':
    dataset = load_reddit()
    g : dgl.DGLHeteroGraph = dataset[0]
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    feat = g.ndata['feat']
    labels = g.ndata['label']
    num_classes = dataset[1]
    in_feats = feat.shape[1]
    train_nid = torch.nonzero(train_mask, as_tuple=True)[0]
    hidden_feature = 128

    sampler = dgl.dataloading.MultiLayerNeighborSampler([10, 25, 50])
    dataloader = dgl.dataloading.NodeDataLoader(
        g, train_nid, sampler,
        batch_size=2000,
        shuffle=True,
        drop_last=False,
        num_workers=4)

    model = GAT(3, in_feats, hidden_feature, num_classes, [2, 2, 2], F.relu, 0.5, 0.5, 0.5, 0.5)

    device = "cuda:0"
    model = model.to(torch.device(device))
    opt = torch.optim.Adam(model.parameters())
    loss_fcn = nn.CrossEntropyLoss()

    for epoch in range(1):
        for input_nodes, output_nodes, blocks in dataloader:
            blocks = [b.to(torch.device(device)) for b in blocks]
            input_features = feat[input_nodes].to(torch.device(device))
            pred = model(blocks, input_features)
            output_labels = labels[output_nodes].to(torch.device(device))
            loss = loss_fcn(pred, output_labels)
            opt.zero_grad()
            loss.backward()
            opt.step()
            break

    with torch.no_grad():
        pred = model.inference(g, 10000, torch.device(device), feat)
        func_score = (torch.argmax(pred, dim=1) == labels).float().sum() / len(pred)

Steps to reproduce the behavior:

Sometime I can run the code successfully, but sometime error will occur. This is the most common error I met, and it happened in h.cpu() or h.to(device).

Traceback (most recent call last):
  File "/home/ec2-user/inference_helper/bug.py", line 146, in <module>
    pred = model.inference(g, 10000, torch.device(device), feat)
  File "/home/ec2-user/inference_helper/bug.py", line 95, in inference
    y[output_nodes] = h.cpu()
RuntimeError: CUDA error: invalid argument

Sometime this error may occur:

Traceback (most recent call last):
  File "/home/ec2-user/inference_helper/bug.py", line 146, in <module>
    pred = model.inference(g, 10000, torch.device(device), feat)
  File "/home/ec2-user/inference_helper/bug.py", line 89, in inference
    h = layer(block, h)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/dgl-0.9-py3.9-linux-x86_64.egg/dgl/nn/pytorch/conv/gatconv.py", line 282, in forward
    feat_src = feat_dst = self.fc(h_src).view(
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (40736x602 and 256x256)

Expected behavior

Environment

DGL Version (e.g., 1.0): 0.9
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.10
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source):
Python version: 3.9.4
CUDA/cuDNN version (if applicable): 11.3
GPU models and configuration (e.g. V100): Tesla V4
Any other relevant information: g4dn.8xlarge instance

Additional context

Issue Analytics

State:
Created a year ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

yinpeiqicommented, Apr 12, 2022

Thanks! I think you are right. I remove all ndata/edata before pin the graph and it works without that error. I tried for many times and it seems that error not happen anymore.

for k in list(g.ndata.keys()):
    g.ndata.pop(k)
for k in list(g.edata.keys()):
    g.edata.pop(k)

0reactions

yaox12commented, Apr 12, 2022

I ran your script several times and find that sometimes it will fail in pinning the graph and raise the error CUDA: part or all of the requested memory range is already mapped. When you set use_uva=True in the dataloader, it will pin the graph as well the ndata/edata. I guess this error may be caused by that you try to pin the graph and ndata/edata when they have not been totally unpinned. That’s why removing g.unpin_memory_() works I think. After removing it, I ran your script 10 times and it didn’t throw an error. If you still have the issue, you can try adding torch.cuda.synchronize() below unpin_memory_inplace(x).