question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA Error preventing GPU usage until runtime restarted

See original GitHub issue
  • Describe the current behavior:

Colab doesn’t throw an error when a CUDA originally fails, but rather throws a CUDA error on the following run and prevents any further usage of CUDA until the runtime is restarted.

RuntimeError: CUDA error: an illegal memory access was encountered
  • Describe the expected behavior:

Colab should throw the CUDA error when the error first appears due to a device mismatch, and then reset CUDA’s internal state so that future usage isn’t blocked. Edit: just realized that this is basically the GPU crashing, not an “internal state” issue.

  • The web browser you are using (Chrome, Firefox, Safari, etc.):
Google Chrome | 84.0.4147.105 (Official Build) (64-bit)
OS | macOS Version 10.15.5
JavaScript | V8 8.4.371.22
  • Link:

https://colab.research.google.com/drive/1TxAkgCUaFSfy7Cja7aJCie6ziAhmO1RG?usp=sharing

  • Further Explanation:

I’m using PyTorch and creating a Linear layer in the forward function, rather than the init function, which is totally bad practice and shouldn’t work. That isn’t the problem though. The problem is that when I pass a tensor (which is located on CUDA) through this Linear layer (which I believe is located on CPU, because it isn’t declared a member of the PyTorch module), it seems to break some internal state of CUDA without throwing an error. Then when I run the second cell the second time, it throws an error on moving a tensor to CUDA for no apparent reason. By the way, GPU acceleration should be on for all of this.

I tried narrowing this bug down a lot, but please let me know if I’m missing something. I also thought about posting this report to PyTorch, but I’m quite sure it’s an issue with how Colab is handling hardware errors (maybe in some low-level C library).

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
Nepohcommented, Sep 22, 2020

@KhoomeiK I had no luck with restarting the runtime. But installing a previous version of PyTorch did the trick (original version was 1.6.0+cu101, I installed 1.5.1+cu101). Just in case someone is looking for this:

!pip uninstall --y torch torchvision
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
1reaction
JonathanSumcommented, Aug 13, 2020

@KhoomeiK Yes, it works. I have to restart the runtime.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resetting GPU and driver after CUDA error - Stack Overflow
I have tried to reset the GPU and driver by changing the desktop resolution, but that doesn't help. The only fix I have...
Read more >
Solving "CUDA out of memory" Error - Kaggle
If you try to train multiple models on GPU, you are most likely to encounter some error similar to this one: RuntimeError: CUDA...
Read more >
Cuda Error #4 that requires PC Reboot, Help!!!
Hi! So I have an CUDA application (multiple differnte kernels) that runs on a Quadro 6000 with cuda driver 320.00 running on TCC...
Read more >
CUDA Error: Device-Side Assert Triggered: Solved | Built In
Restart your kernel session fresh. Every variable should reset, and you should have a brand new GPU session.
Read more >
Jupyter+pytorch, or cuda memory help: stop notebook mid ...
Again ReStarting the notebook (Kernel, Restart) seems to be the ONLY thing that “fixes” a cuda out of memory error. 1 Like. narvind2003...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found