question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Illegal memory access with latest PyTorch/Apex

See original GitHub issue

Hi,

I’m trying to train a model through apex using latest Pytorch and Apex masters, but every forward call ends up with the following error:

  File "main.py", line 348, in <module>
    metrics = train(batch, args.fp16)
  File "main.py", line 88, in train
    scaled_loss.backward()
  File "/nfs/project/mr/miniconda/envs/machine_reading/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/nfs/project/mr/miniconda/envs/machine_reading/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The full log with CUDNN_LOGINFO_DBG flag set is attached below. It’s using CUDNN 7.3.1 / CUDA 9.2

100936.txt

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
mfuntowiczcommented, Nov 20, 2018

Side node : recompiling Pytorch after upgrading to CuDNN 7.4.1 on the desktop with the 1080Ti fixed the issue. I’m rebuilding a Pytorch environment linking against 7.4.1 to validate on V100 cards.

I let you know asap. Morgan

0reactions
mcarillicommented, Nov 21, 2018

Glad that worked, but let us know if you continue to have trouble resolving the performance issue…

Read more comments on GitHub >

github_iconTop Results From Across the Web

distributed - PyTorch Forums
I met a strange illegal memory access error during evaluation step. It happens randomly. I don't think there is anything wrong in my...
Read more >
CUDA_ERROR_ILLEGAL_ADD...
Apex | CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered ... The last cell is where I run the train.py function.
Read more >
CUDA memory error when trying to train model - vision
I'm trying to train a pytorch model pix2pix. ... CUDA error: an illegal memory access was encountered Traceback (most recent call last): ...
Read more >
CUDA error: an illegal memory access was encountered with ...
Try to use the latest PyTorch (1.10). The error indicates an out of bound memory access similar to a segfault on the CPU,...
Read more >
CUDA error: an illegal memory access was encountered
Error we get: We get an illegal memory access error during the forward pass of a linear layer (relatve_pose_regressor module).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found