question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

An illegal memory access was encountered

See original GitHub issue

🐛 Bug

I just run the below script with 4 x P100.

PYTHON="/root/miniconda3/bin/python"
CONFIG="./configs/e2e_mask_rcnn_R_50_FPN_1x.yaml"

export NGPUS=4
${PYTHON} -m torch.distributed.launch --nproc_per_node=$NGPUS \
	./tools/train_net.py --config-file $CONFIG

Expected behavior

Here is the error information, image

It seems that the first two few iterations are ok. (iter: 0, 20)

Then in the iter 40, the number in the bracket becomes nan. Then I got the error informing me that an illegal memory was encountered.

Environment

I just install all the enviroments follow the instructions

  • PyTorch Version 1.0
  • Linux 16.04
  • Python version: 3.6
  • CUDA/cuDNN version: 8.0
  • GPU models and configuration: 4 X P100

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
PkuRainBowcommented, Oct 29, 2018

@fmassa Thanks for your kind help.

I will update if I have got progress.

0reactions
zimenglan-sysu-512commented, Dec 4, 2018

hi @fmassa i want to add light-head rcnn to train R-50-C4 on COCO dataset, maybe something wrong in my code to implement. i need to check my code. thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

an illegal memory access was encountered
A very very strange place is that when I use a smaller matrix (120×32400) to do a test, there is no error occured...
Read more >
an illegal memory access was encountered
Your GPU function makeGrey takes it's arguments by reference, those values live on the stack, not in GPU-memory, take them by value instead....
Read more >
RuntimeError: CUDA error: an illegal memory access was ...
This is not a memory problem, as I am using only about 3-6 GB of VRAM, out of 12 GB on my RTX...
Read more >
Getting lots of "CUDA: an illegal memory access was ...
Getting lots of "CUDA: an illegal memory access was encountered" while benchmarking most algorithms. I've been mining with my two 1070s for a ......
Read more >
Cuda illegal memory access(kokkos) multiple MPI per GPU
I have encountered cuda illegal memory access(lib kokkos) when using multiple MPI per GPU. With KOKKOS, you should have only one MPI rank...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found