Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101)

See original GitHub issue

File "../ptx/fit_extension.py", line 386, in _train_epoch scaled_loss.backward() File "/home/suiguobin/anaconda3/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "../../apex/apex/amp/handle.py", line 125, in scale_loss optimizer._post_amp_backward(loss_scaler) File "../../apex/apex/amp/_process_optimizer.py", line 123, in post_backward_with_master_weights models_are_masters=False) File "../../apex/apex/amp/scaler.py", line 113, in unscale 1./scale) File "../../apex/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__ *args) RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f17e2ce2021 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f17e2ce18ea in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x1805 (0x7f17db4c3a75 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x15a8 (0x7f17db4b8748 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #4: <unknown function> + 0x1784f (0x7f17db4b684f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #5: <unknown function> + 0x14e4f (0x7f17db4b3e4f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) <omitting python frames> frame #54: __libc_start_main + 0xf5 (0x7f1824cc3b45 in /lib/x86_64-linux-gnu/libc.so.6)

I use single card to run the amp, it produced the above error. However I use more than one cards to train, it doesn’t produce ant error.

Issue Analytics

State:
Created 4 years ago
Comments:32 (5 by maintainers)

Top GitHub Comments

17reactions

tatsuhiko-inouecommented, May 29, 2019

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

4reactions

DuaneNielsencommented, Dec 23, 2019

Yep, same problem.

device = torch.device(‘cuda:0’) works OK

device = torch.device(‘cuda:1’) fails when calling scaled_loss.backward()

Fixed by a call to torch.cuda.set_device(torch.device(‘cuda:1’))

I’m guessing somewhere in your code, there are 2 references being kept to different devices.

Can also be fixed by running opt-level O0, so I guess that means it’s likely not my code.

Top Results From Across the Web

RuntimeError: CUDA error: an illegal memory access was ...

Hi,everyone! I met a strange illegal memory access error. It happens randomly without any regular pattern. The code is really simple.

RuntimeError: CUDA error: an illegal memory access was ...

I get the error when I use more than a certain number of API :'gym.create_actor'. Not number of all actors. GPU: RTX A6000...

CUDA error: an illegal memory access was encountered ...

I implemented a pytorch cuda extension of xnor_gemm. when I run this gemm in a small demo.py there is no problem But there...

Moving a tensor to cuda device cause illegal memory access ...

import torch a = torch.randn(5,5,5) a = a.to("cuda") # or just a ... RuntimeError: CUDA error: an illegal memory access was encountered.

CUDA error: an illegal memory access was encountered - Part ...

RuntimeError : CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101)

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Does Apex really support PyTorch 0.4.1?

Dose data_prefetcher() really speed up training?