Illegal memory access with latest PyTorch/Apex
See original GitHub issueHi,
I’m trying to train a model through apex using latest Pytorch and Apex masters, but every forward call ends up with the following error:
File "main.py", line 348, in <module>
metrics = train(batch, args.fp16)
File "main.py", line 88, in train
scaled_loss.backward()
File "/nfs/project/mr/miniconda/envs/machine_reading/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/nfs/project/mr/miniconda/envs/machine_reading/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
The full log with CUDNN_LOGINFO_DBG flag set is attached below. It’s using CUDNN 7.3.1 / CUDA 9.2
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
distributed - PyTorch Forums
I met a strange illegal memory access error during evaluation step. It happens randomly. I don't think there is anything wrong in my...
Read more >CUDA_ERROR_ILLEGAL_ADD...
Apex | CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered ... The last cell is where I run the train.py function.
Read more >CUDA memory error when trying to train model - vision
I'm trying to train a pytorch model pix2pix. ... CUDA error: an illegal memory access was encountered Traceback (most recent call last): ...
Read more >CUDA error: an illegal memory access was encountered with ...
Try to use the latest PyTorch (1.10). The error indicates an out of bound memory access similar to a segfault on the CPU,...
Read more >CUDA error: an illegal memory access was encountered
Error we get: We get an illegal memory access error during the forward pass of a linear layer (relatve_pose_regressor module).
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Side node : recompiling Pytorch after upgrading to CuDNN 7.4.1 on the desktop with the 1080Ti fixed the issue. I’m rebuilding a Pytorch environment linking against 7.4.1 to validate on V100 cards.
I let you know asap. Morgan
Glad that worked, but let us know if you continue to have trouble resolving the performance issue…