An illegal memory access was encountered
See original GitHub issue🐛 Bug
I just run the below script with 4 x P100.
PYTHON="/root/miniconda3/bin/python"
CONFIG="./configs/e2e_mask_rcnn_R_50_FPN_1x.yaml"
export NGPUS=4
${PYTHON} -m torch.distributed.launch --nproc_per_node=$NGPUS \
./tools/train_net.py --config-file $CONFIG
Expected behavior
Here is the error information,
It seems that the first two few iterations are ok. (iter: 0, 20)
Then in the iter 40, the number in the bracket becomes nan. Then I got the error informing me that an illegal memory was encountered.
Environment
I just install all the enviroments follow the instructions
- PyTorch Version 1.0
- Linux 16.04
- Python version: 3.6
- CUDA/cuDNN version: 8.0
- GPU models and configuration: 4 X P100
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (7 by maintainers)
Top Results From Across the Web
an illegal memory access was encountered
A very very strange place is that when I use a smaller matrix (120×32400) to do a test, there is no error occured...
Read more >an illegal memory access was encountered
Your GPU function makeGrey takes it's arguments by reference, those values live on the stack, not in GPU-memory, take them by value instead....
Read more >RuntimeError: CUDA error: an illegal memory access was ...
This is not a memory problem, as I am using only about 3-6 GB of VRAM, out of 12 GB on my RTX...
Read more >Getting lots of "CUDA: an illegal memory access was ...
Getting lots of "CUDA: an illegal memory access was encountered" while benchmarking most algorithms. I've been mining with my two 1070s for a ......
Read more >Cuda illegal memory access(kokkos) multiple MPI per GPU
I have encountered cuda illegal memory access(lib kokkos) when using multiple MPI per GPU. With KOKKOS, you should have only one MPI rank...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@fmassa Thanks for your kind help.
I will update if I have got progress.
hi @fmassa i want to add
light-head rcnn
to trainR-50-C4
on COCO dataset, maybe something wrong in my code to implement. i need to check my code. thanks.