question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

using gpu cause cudnn error

See original GitHub issue

🐛 Bug

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [100, 113, 10]], which is output 0 of CudnnRnnBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

To Reproduce

I can run my code while using cpu, but swich to cuda cause error like this

Expected behavior

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.6.0
  • PyTorch Version (e.g., 1.10): 1.11.0
  • Python version (e.g., 3.9): 3.8
  • OS (e.g., Linux): Linux centos
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: GPU 2080 ti
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
kimmy966commented, Apr 7, 2022

is it possible that in your logic, a layer is modified already by the optimizer step, but there is another call of loss.backward that relies on old weights to compute the gradients?

or try this if you have ReLU layers NVlabs/FUNIT#23 (comment)

thank you, set relu layer param inplace = False save my model~ hh

0reactions
rohitgr7commented, Apr 7, 2022

Yes I have Relu in my layers, but I don’t understand why I can train this model using CPU without any error occurred?

need to investigate the actual issue. It’s related to PyTorch I guess, but still curious.

Read more comments on GitHub >

github_iconTop Results From Across the Web

F.conv2d() causes RuntimeError: cuDNN error: ...
It looks like there is a bug in CUDNN v8 for Titan X class (and maybe other classes of GPU). The version of...
Read more >
Cudnn Error in initializeCommonContext - TensorRT
Description. Hi, I met a problem when I tried to deserialize a TensorRT engine and create the context. The system threw an Error...
Read more >
RuntimeError: cuDNN error
I am running this code in a computer with rtx 3090ti github_code. However, the code raises an error with first forward layer.
Read more >
RuntimeError: cuDNN error
If it is not that your model/data is too big then it is because your GPU has not freed the memory. Go to...
Read more >
Memory Management and Using Multiple GPUs
If you just call cuda , then the tensor is placed on GPU 0. The torch.nn. ... If operands are on different devices,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found