`CUDA error: unspecified launch failure`, similar to #3802
See original GitHub issueš Bug
I am seeing the same issue that was reported as fixed in #3841 in the latest 0.9.0 (and everything down to releases lower than 0.8.0. See #3802 for more context. As previous reported by @wsjeon, I am seeing this issue using DGL with pytorch lightning, though I havenāt tried to see if I can reproduce the problem without using this package.
Tagging @BarclayII and @nv-dlasalle who previously investigated this.
To Reproduce
Steps to reproduce the behavior:
- Setup environment
conda config --env --add channels dglteam
conda config --env --add channels pytorch
conda install dgl-cuda11.3 pytorch-lightning cudatoolkit=11.3 pytorch=1.12.1
- Run this:
import torch
import dgl
import pytorch_lightning as pl
class MyModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(10, 10)
def training_step(self, batch, batch_nb):
return torch.tensor(2)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
class MyDataset(torch.utils.data.Dataset):
def __init__(self):
super().__init__()
def __len__(self):
return 10
def __getitem__(self, idx):
g = dgl.graph(data=([0,1],[1,0]), num_nodes=2)
return g, torch.tensor([0])
def collate_graphs(samples):
graphs = [x[0] for x in samples]
batched_graph = dgl.batch(graphs)
targets = torch.cat([x[1] for x in samples])
return batched_graph, targets
loader = torch.utils.data.DataLoader(dataset=MyDataset(), batch_size=2, num_workers=2, collate_fn=collate_graphs)
model = MyModel()
trainer = pl.Trainer(
strategy='ddp',
accelerator='gpu',
devices=[0],
fast_dev_run=True,
)
trainer.fit(model, loader)
Stack trace:
Epoch 0: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 171, in advance
batch = next(data_fetcher)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
return self.fetching_function()
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 269, in fetching_function
return self.move_to_device(batch)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 284, in move_to_device
batch = self.batch_to_device(batch)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
output = fn(*args, **kwargs)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in batch_to_device
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/core/lightning.py", line 291, in _apply_batch_transfer_handler
batch = hook(batch, device, dataloader_idx)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/core/hooks.py", line 713, in transfer_batch_to_device
return move_data_to_device(batch, device)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/utilities/apply_func.py", line 354, in move_data_to_device
return apply_to_collection(batch, dtype=dtype, function=batch_to)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/utilities/apply_func.py", line 121, in apply_to_collection
v = apply_to_collection(
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
return function(data, *args, **kwargs)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/utilities/apply_func.py", line 347, in batch_to
data_output = data.to(device, **kwargs)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/heterograph.py", line 5448, in to
ret._graph = self._graph.copy_to(utils.to_dgl_context(device))
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/heterograph_index.py", line 236, in copy_to
return _CAPI_DGLHeteroCopyTo(self, ctx.device_type, ctx.device_id)
File "dgl/_ffi/_cython/./function.pxi", line 293, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 225, in dgl._ffi._cy3.core.FuncCall
File "dgl/_ffi/_cython/./function.pxi", line 215, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [12:47:27] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:114: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: unspecified launch failure
Stack trace:
[bt] (0) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f51d135fd6f]
[bt] (1) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::AllocDataSpace(DLContext, unsigned long, unsigned long, DLDataType)+0x108) [0x7f51d183a4a8]
[bt] (2) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext)+0x361) [0x7f51d16ac5d1]
[bt] (3) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::CopyTo(DLContext const&, void* const&) const+0xc7) [0x7f51d16e8bb7]
[bt] (4) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/libdgl.so(dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&, void* const&)+0x317) [0x7f51d17f9db7]
[bt] (5) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/libdgl.so(dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&, void* const&)+0x109) [0x7f51d16fa939]
[bt] (6) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/libdgl.so(+0x73b9c9) [0x7f51d17079c9]
[bt] (7) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f51d168a928]
[bt] (8) miniconda3/envs/dgl-test/lib/python3.10/site-packages/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x16143) [0x7f51f4995143]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "gnn-tagger/GNNJetTagger/gnn_tagger/training/minimal.py", line 43, in <module>
trainer.fit(model, loader)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 738, in _call_and_handle_interrupt
self._teardown()
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1300, in _teardown
self.strategy.teardown()
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 482, in teardown
self.lightning_module.cpu()
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 147, in cpu
return super().cpu()
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 738, in cpu
return self._apply(lambda t: t.cpu())
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "miniconda3/envs/dgl-test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 738, in <lambda>
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: unspecified launch failure
Expected behavior
Environment
- DGL Version (e.g., 1.0): dgl-cuda11.3 0.9.0 py310_0
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 1.12.1 py3.10_cuda11.3_cudnn8.3.2_0
- OS (e.g., Linux): Linux
- How you installed DGL (
conda,pip, source): conda - Build command you used (if compiling from source): NA
- Python version: 3.10
- CUDA/cuDNN version (if applicable): 11.3
- GPU models and configuration (e.g. V100): GeForce RTX 2080
- Any other relevant information:
Issue Analytics
- State:
- Created a year ago
- Comments:10 (1 by maintainers)
Top Results From Across the Web
RuntimeError: CUDA error: unspecified launch failure
RuntimeError: CUDA error: unspecified launch failure #31702 ... When I run on a small data set, the above error occurs when the data...
Read more >CUDA error message : unspecified launch failure
For me CUDA was generating "unspecified launch failure" due to an infinite recursion not detected by nvcc . The code was doing simply:...
Read more >cuda error: unspecified launch failure (err_no=4) nbminer. ...
Try rebooting the rig. It seems that the miner can "hang" if you do OC changes, and a reboot can fix the issue....
Read more >Trouble running miniZ
So I changed 1 GPU in my rig and started getting this error ā ( CUDA error 4 'unspecified launch failure' in line...
Read more >IJ10618: GPU SORT: UNSPECIFIED LAUNCH FAILURE
Error Message: When attempting to sort an array of floating-point values using a GPU, an 'unspecified launch failure' may occur depending on the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@BarclayII Cannot repro with the GraphSAGE example and dgl 0.9.0. Multi-worker CPU sampling and CUDA dataloader device should have been covered in the unit test now. https://github.com/dmlc/dgl/blob/5ba5106acab6a642e9b790e5331ee519112a5623/tests/pytorch/test_dataloader.py#L185-L187
@samvanstroud Are you using PyTorch 1.12.1? I donāt think DGL has released PyTorch 1.12.1 support. Can you try PyTorch 1.12.0?
@mufeili I can reproduce this issue with PyTorch 1.12.1, but havenāt found the root cause. Regarding the error message, it seems not related to the tensoradaptor so Iām not sure what changes in PyTorch 1.12.1 break it. Iāll try building from source with PyTorch 1.12.1 and see if the error goes away.
Update: The error disappears when building DGL from source with PyTorch 1.12.1.