When using uva, CUDA error occur at tensor.to(device)
See original GitHub issue🐛 Bug
When try to perform inference using uva, several errors will occurred. I try the code for several times, and each time I got different result.
To Reproduce
import dgl
import torch
import torch.nn.functional as F
import torch.nn as nn
import gc
import dgl
from dgl.nn import GATConv
from ogb.nodeproppred import DglNodePropPredDataset
from dgl.data import CiteseerGraphDataset, RedditDataset
from dgl.nn import GraphConv
class GAT(nn.Module):
def __init__(self,
num_layers,
in_dim,
num_hidden,
num_classes,
heads,
activation,
feat_drop,
attn_drop,
negative_slope,
residual):
super(GAT, self).__init__()
self.num_layers = num_layers
self.gat_layers = nn.ModuleList()
self.activation = activation
self.hidden_features = num_hidden
self.heads = heads
self.out_features = num_classes
# input projection (no residual)
self.gat_layers.append(GATConv(
in_dim, num_hidden, heads[0],
feat_drop, attn_drop, negative_slope, False, self.activation, allow_zero_in_degree=True))
# hidden layers
for l in range(1, num_layers - 1):
# due to multi-head, the in_dim = num_hidden * num_heads
self.gat_layers.append(GATConv(
num_hidden * heads[l-1], num_hidden, heads[l],
feat_drop, attn_drop, negative_slope, residual, self.activation, allow_zero_in_degree=True))
# output projection
self.gat_layers.append(GATConv(
num_hidden * heads[-2], num_classes, heads[-1],
feat_drop, attn_drop, negative_slope, residual, None, allow_zero_in_degree=True))
def forward(self, g, inputs):
h = inputs
for l in range(self.num_layers - 1):
h = self.gat_layers[l](g[l], h).flatten(1)
# output projection
logits = self.gat_layers[-1](g[-1], h).mean(1)
return logits
def forward_full(self, g, inputs):
h = inputs
for l in range(self.num_layers - 1):
h = self.gat_layers[l](g, h).flatten(1)
# output projection
logits = self.gat_layers[-1](g, h).mean(1)
return logits
def inference(self, g, batch_size, device, x):
torch.cuda.reset_peak_memory_stats()
for l, layer in enumerate(self.gat_layers):
gc.collect()
torch.cuda.empty_cache()
if l != self.num_layers - 1:
y = torch.zeros(g.number_of_nodes(), self.heads[l] * self.hidden_features)
else:
y = torch.zeros(g.number_of_nodes(), self.out_features)
g.ndata['feat'] = x
sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1, prefetch_node_feats=['feat'])
dataloader = dgl.dataloading.NodeDataLoader(
g, torch.arange(g.number_of_nodes()).to(device), sampler,
batch_size=batch_size,
shuffle=False,
drop_last=False,
use_uva=True,
device=device,
num_workers=0)
for input_nodes, output_nodes, blocks in dataloader:
torch.cuda.reset_peak_memory_stats()
torch.cuda.empty_cache()
block = blocks[0].to(device)
h = block.srcdata['feat']
h = h.to(device)
h = layer(block, h)
if l == self.num_layers - 1:
logits = h.mean(1)
y[output_nodes] = logits.cpu()
else:
h = h.flatten(1)
y[output_nodes] = h.cpu()
return y
def load_reddit():
data = RedditDataset(self_loop=True)
g = data[0]
g.ndata['features'] = g.ndata['feat']
return g, data.num_classes
if __name__ == '__main__':
dataset = load_reddit()
g : dgl.DGLHeteroGraph = dataset[0]
train_mask = g.ndata['train_mask']
val_mask = g.ndata['val_mask']
test_mask = g.ndata['test_mask']
feat = g.ndata['feat']
labels = g.ndata['label']
num_classes = dataset[1]
in_feats = feat.shape[1]
train_nid = torch.nonzero(train_mask, as_tuple=True)[0]
hidden_feature = 128
sampler = dgl.dataloading.MultiLayerNeighborSampler([10, 25, 50])
dataloader = dgl.dataloading.NodeDataLoader(
g, train_nid, sampler,
batch_size=2000,
shuffle=True,
drop_last=False,
num_workers=4)
model = GAT(3, in_feats, hidden_feature, num_classes, [2, 2, 2], F.relu, 0.5, 0.5, 0.5, 0.5)
device = "cuda:0"
model = model.to(torch.device(device))
opt = torch.optim.Adam(model.parameters())
loss_fcn = nn.CrossEntropyLoss()
for epoch in range(1):
for input_nodes, output_nodes, blocks in dataloader:
blocks = [b.to(torch.device(device)) for b in blocks]
input_features = feat[input_nodes].to(torch.device(device))
pred = model(blocks, input_features)
output_labels = labels[output_nodes].to(torch.device(device))
loss = loss_fcn(pred, output_labels)
opt.zero_grad()
loss.backward()
opt.step()
break
with torch.no_grad():
pred = model.inference(g, 10000, torch.device(device), feat)
func_score = (torch.argmax(pred, dim=1) == labels).float().sum() / len(pred)
Steps to reproduce the behavior:
Sometime I can run the code successfully, but sometime error will occur. This is the most common error I met, and it happened in h.cpu() or h.to(device).
Traceback (most recent call last):
File "/home/ec2-user/inference_helper/bug.py", line 146, in <module>
pred = model.inference(g, 10000, torch.device(device), feat)
File "/home/ec2-user/inference_helper/bug.py", line 95, in inference
y[output_nodes] = h.cpu()
RuntimeError: CUDA error: invalid argument
Sometime this error may occur:
Traceback (most recent call last):
File "/home/ec2-user/inference_helper/bug.py", line 146, in <module>
pred = model.inference(g, 10000, torch.device(device), feat)
File "/home/ec2-user/inference_helper/bug.py", line 89, in inference
h = layer(block, h)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/dgl-0.9-py3.9-linux-x86_64.egg/dgl/nn/pytorch/conv/gatconv.py", line 282, in forward
feat_src = feat_dst = self.fc(h_src).view(
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (40736x602 and 256x256)
Expected behavior
Environment
- DGL Version (e.g., 1.0): 0.9
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.10
- OS (e.g., Linux): Linux
- How you installed DGL (
conda
,pip
, source): source - Build command you used (if compiling from source):
- Python version: 3.9.4
- CUDA/cuDNN version (if applicable): 11.3
- GPU models and configuration (e.g. V100): Tesla V4
- Any other relevant information: g4dn.8xlarge instance
Additional context
Issue Analytics
- State:
- Created a year ago
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Guide 3: Debugging in PyTorch - UvA DL Notebooks
This error occurs if you re-use a tensor from the computation graph of the previous batch. This should usually not happen. Make sure...
Read more >Moving a tensor to cuda device cause illegal memory access ...
I am trying the following snippet in Colab but causes the following error. Is it wrong to move a tensor object to Cuda...
Read more >6.16. Unified Addressing - NVIDIA Documentation Center
CUDA devices can share a unified address space with the host. ... The start address and end address of the memory range will...
Read more >CUDA Python 12.0.0 documentation - GitHub Pages
Device supports accessing memory using Tensor Map. ... This indicates that an async error has occurred in a device outside of CUDA. If...
Read more >dgl.dataloading.dataloader — DGL 0.9.1post1 documentation
(2) each LazyFeature object is replaced with a tensor or future. ... True if the graph is on CPU, :attr:`device` is CUDA, and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks! I think you are right. I remove all ndata/edata before pin the graph and it works without that error. I tried for many times and it seems that error not happen anymore.
I ran your script several times and find that sometimes it will fail in pinning the graph and raise the error
CUDA: part or all of the requested memory range is already mapped
. When you setuse_uva=True
in the dataloader, it will pin the graph as well the ndata/edata. I guess this error may be caused by that you try to pin the graph and ndata/edata when they have not been totally unpinned. That’s why removingg.unpin_memory_()
works I think. After removing it, I ran your script 10 times and it didn’t throw an error. If you still have the issue, you can try addingtorch.cuda.synchronize()
belowunpin_memory_inplace(x)
.