How to fix the RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED ? Thank you!
See original GitHub issueINTERFACE:
config yaml: config/cityscapes/darknet21_aspp.yaml
log dir /home/pc/logs/2019-8-20-16:13/
model path None
eval only False
No batchnorm False
----------
Commit hash (training version): b'5368eed'
----------
Opening config file config/cityscapes/darknet21_aspp.yaml
No pretrained directory found.
Copying files to /home/pc/logs/2019-8-20-16:13/ for further reference.
WARNING: Logging before flag parsing goes to stderr.
W0820 16:13:16.396194 140436803987200 deprecation_wrapper.py:119] From ../../common/logger.py:16: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
Images from: /home3/data/city/city_selected/leftImg8bit/train
Labels from: /home3/data/city/city_selected/gtFine/train
Inference batch size: 1
Images from: /home3/data/city/city_selected/leftImg8bit/val
Labels from: /home3/data/city/city_selected/gtFine/val
Original OS: 32
New OS: 8
Strides: [2, 2, 2, 1, 1]
Dilations: [1, 1, 1, 2, 4]
Trying to get backbone weights online from Bonnetal server.
Using pretrained weights from bonnetal server for backbone
[Decoder] os: 4 in: 128 skip: 128 out: 128
[Decoder] os: 2 in: 128 skip: 64 out: 64
[Decoder] os: 1 in: 64 skip: 32 out: 32
Using normalized weights as bias for head.
No path to pretrained, using bonnetal Imagenet backbone weights and random decoder.
Total number of parameters: 19239412
Total number of parameters requires_grad: 19239412
Param encoder 14920544
Param decoder 4318208
Param head 660
Training in device: cuda
Ignoring class 19 in IoU evaluation
[IOU EVAL] IGNORE: tensor([19])
[IOU EVAL] INCLUDE: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18])
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [576,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [352,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [353,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [354,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [2,0,0], thread: [355,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "train.py", line 117, in <module>
trainer.train()
File "../../tasks/segmentation/modules/trainer.py", line 302, in train
scheduler=self.scheduler)
File "../../tasks/segmentation/modules/trainer.py", line 487, in train_epoch
loss.backward()
File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.5/dist-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (6 by maintainers)
Top Results From Across the Web
cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using ...
In my problem i used to kill exisiting process in gpu.Use nvidia-smi to check what are the process are running.Use killall -9 python3(what ......
Read more >RuntimeError: cuDNN error - PyTorch Forums
Just to confirm, run watch nvidia-smi . Ensure your GPU memory is near empty before u run the script and see what happens...
Read more >How to fix 'Runtimeerror: cudnn error - Candid.Technology
The simplest way to fix this issue is by using the right CUDA version (11.1). You can use the pip command below to...
Read more >Image Segmentation: RuntimeError: cuDNN error
I am working on a image segmentation problem. The data is from Deep Global Land Cover Classification dataset. It is a image segmentation ......
Read more >Cudnn_status_not_initialized - NVIDIA Developer Forums
RuntimeError : cuDNN error: CUDNN_STATUS_NOT_INITIALIZED ... When you install cuda, some sample applications are installed alongside it in ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Hi, I updated your comment to make it easier to read. The error message is because your labels are out of range. The cityscapes data needs to be preprocessed before use, to put all labels in the 0-19 range, using their api, which you can access here. The definition of the mapping for each label is defined by the user, and can be found on this script of their api. I usually replace the trainIds 255 and -1 by 19 to make a consistent cross-entropy-able label set.
@tano297 @newforrestgump001 I am finding the same error but do not seem to be able to solve it. I have changes the labels and preprocessed the label file (changed
labels.pyand ranpython createTrainIdLabelImgs.py) but the code still exits before completingFile ../../tasks/segmentation/modules/trainer.py, line 488, in train_epoch loss.backward()Do you have any idea what I could do to solve this issue?
My
labels.pyfile in cityscapes:Traceback: