CUDA Error preventing GPU usage until runtime restarted
See original GitHub issue- Describe the current behavior:
Colab doesn’t throw an error when a CUDA originally fails, but rather throws a CUDA error on the following run and prevents any further usage of CUDA until the runtime is restarted.
RuntimeError: CUDA error: an illegal memory access was encountered
- Describe the expected behavior:
Colab should throw the CUDA error when the error first appears due to a device mismatch, and then reset CUDA’s internal state so that future usage isn’t blocked. Edit: just realized that this is basically the GPU crashing, not an “internal state” issue.
- The web browser you are using (Chrome, Firefox, Safari, etc.):
Google Chrome | 84.0.4147.105 (Official Build) (64-bit)
OS | macOS Version 10.15.5
JavaScript | V8 8.4.371.22
- Link:
https://colab.research.google.com/drive/1TxAkgCUaFSfy7Cja7aJCie6ziAhmO1RG?usp=sharing
- Further Explanation:
I’m using PyTorch and creating a Linear layer in the forward function, rather than the init function, which is totally bad practice and shouldn’t work. That isn’t the problem though. The problem is that when I pass a tensor (which is located on CUDA) through this Linear layer (which I believe is located on CPU, because it isn’t declared a member of the PyTorch module), it seems to break some internal state of CUDA without throwing an error. Then when I run the second cell the second time, it throws an error on moving a tensor to CUDA for no apparent reason. By the way, GPU acceleration should be on for all of this.
I tried narrowing this bug down a lot, but please let me know if I’m missing something. I also thought about posting this report to PyTorch, but I’m quite sure it’s an issue with how Colab is handling hardware errors (maybe in some low-level C library).
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (1 by maintainers)
@KhoomeiK I had no luck with restarting the runtime. But installing a previous version of PyTorch did the trick (original version was 1.6.0+cu101, I installed 1.5.1+cu101). Just in case someone is looking for this:
@KhoomeiK Yes, it works. I have to restart the runtime.