Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Illegal memory access with latest PyTorch/Apex

See original GitHub issue

Hi,

I’m trying to train a model through apex using latest Pytorch and Apex masters, but every forward call ends up with the following error:

  File "main.py", line 348, in <module>
    metrics = train(batch, args.fp16)
  File "main.py", line 88, in train
    scaled_loss.backward()
  File "/nfs/project/mr/miniconda/envs/machine_reading/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/nfs/project/mr/miniconda/envs/machine_reading/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The full log with CUDNN_LOGINFO_DBG flag set is attached below. It’s using CUDNN 7.3.1 / CUDA 9.2

100936.txt

Issue Analytics

State:
Created 5 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

mfuntowiczcommented, Nov 20, 2018

Side node : recompiling Pytorch after upgrading to CuDNN 7.4.1 on the desktop with the 1080Ti fixed the issue. I’m rebuilding a Pytorch environment linking against 7.4.1 to validate on V100 cards.

I let you know asap. Morgan

0reactions

mcarillicommented, Nov 21, 2018

Glad that worked, but let us know if you continue to have trouble resolving the performance issue…

Top Results From Across the Web

distributed - PyTorch Forums

I met a strange illegal memory access error during evaluation step. It happens randomly. I don't think there is anything wrong in my...

CUDA_ERROR_ILLEGAL_ADD...

Apex | CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered ... The last cell is where I run the train.py function.

CUDA memory error when trying to train model - vision

I'm trying to train a pytorch model pix2pix. ... CUDA error: an illegal memory access was encountered Traceback (most recent call last): ...

CUDA error: an illegal memory access was encountered with ...

Try to use the latest PyTorch (1.10). The error indicates an out of bound memory access similar to a segfault on the CPU,...

CUDA error: an illegal memory access was encountered

Error we get: We get an illegal memory access error during the forward pass of a linear layer (relatve_pose_regressor module).

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Illegal memory access with latest PyTorch/Apex

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

negligble performance gains and non convergence on DCGAN using apex (what to change?)

RuntimeError: "bernoulli_scalar_cuda_" not implemented for 'torch.cuda.HalfTensor'