question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

using c10d: illegal memory access

See original GitHub issue

I am working on translation task and I am trying to overlap communication with backward pass.

I upgraded PyTorch from source. When I try c10d(default) with fp16 I am getting the following error. Is there a specific Pytorch versionI need to use for c10d? Previously, I was able to run on multi GPU, fp16 with Pytorch 0.4.1(not using c10d).

Running on 8 GPUs. NCCL version 2.3.5+cuda9.2 CUDA 9.2 Pytorch 1.0.0a0+d4f9dbf

| epoch 001: 0%| | 7/5472 [00:01<22:05, 4.12it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4265, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scal| epoch 001: 0%| | 7/5472 [00:02<25:06, 3.63it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4282, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scal| epoch 001: 0%| | 7/5472 [00:01<22:08, 4.11it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4335, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scal| epoch 001: 0%| | 7/5472 [00:01<23:45, 3.83it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4322, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scal| epoch 001: 0%| | 7/5472 [00:01<20:41, 4.40it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4292, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scale=64.000, wall=35, train_wall=2] THCudaCheck FAIL file=/home/ubuntu/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=271 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 352, in <module> distributed_main(args) File "/home/ubuntu/fairseq/distributed_train.py", line 39, in main single_process_main(args) File "/home/ubuntu/fairseq/train.py", line 90, in main train(args, trainer, task, epoch_itr) File "/home/ubuntu/fairseq/train.py", line 125, in train log_output = trainer.train_step(samples) File "/home/ubuntu/fairseq/fairseq/trainer.py", line 194, in train_step raise e File "/home/ubuntu/fairseq/fairseq/trainer.py", line 176, in train_step ignore_grad File "/home/ubuntu/fairseq/fairseq/tasks/fairseq_task.py", line 174, in train_step optimizer.backward(loss) File "/home/ubuntu/fairseq/fairseq/optim/fp16_optimizer.py", line 102, in backward loss.backward() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in distributed_data_parallel_hook self._queue_reduction(bucket_idx) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 405, in _queue_reduction self.device_ids) RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /home/ubuntu/pytorch/aten/src/THC/THCCachingHostAllocator.cpp:271 ip-10-0-0-168:9882:9882 [5] init.cu:117 NCCL WARN Cuda failure 'an illegal memory access was encountered' ip-10-0-0-168:9882:9882 [5] NCCL INFO init.cu:772 -> 1 terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: /home/ubuntu/pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
cavdardcommented, Nov 8, 2018

It is working in fp32, I have not seen an error for 2 hours. Data: wmt16_en_de_bpe32k Run script:

#!/bin/bash                                                                                                                                                                                                  
HOST_PORT="tcp://10.0.0.168:13333" 
kill_children() {
    for PID in ${PIDS[*]}; do
        kill -TERM $PID
    done
}
NODE=0
RANKS_PER_NODE=8
for i in $(seq 0 7); do
    LOCAL_RANK=$i
    DISTRIBUTED_RANK=$((RANKS_PER_NODE * NODE + LOCAL_RANK))
    NCCL_DEBUG=VERSION NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=^lo,docker0 NCCL_MIN_NRINGS=4 python train.py data-bin/wmt16_en_de_bpe32k   \
              --arch transformer_wmt_en_de_big            \
              --share-all-embeddings                      \
              --optimizer adam                            \
              --adam-betas '(0.9, 0.98)'                  \
              --clip-norm 0.0                             \
              --lr-scheduler inverse_sqrt                 \
              --warmup-init-lr 1e-07                      \
              --warmup-updates 4000                       \
              --lr 0.0010        --fp16                   \
              --min-lr 1e-09                              \
              --dropout 0.3                               \
              --weight-decay 0.0                          \
              --criterion label_smoothed_cross_entropy    \
              --label-smoothing 0.1                       \
              --max-tokens 3584                           \
              --update-freq 1   --max-epoch 35            \
              --distributed-world-size 8                  \
              --distributed-init-method $HOST_PORT        \
              --device-id $LOCAL_RANK                     \
              --distributed-rank $DISTRIBUTED_RANK &
    PIDS[$LOCAL_RANK]=$!
done
trap kill_children SIGTERM SIGINT
for PID in ${PIDS[*]}; do
    wait $PID
done
0reactions
myleottcommented, Nov 16, 2018

This is fixed in the latest pytorch.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: CUDA error: an illegal memory access was ...
Hi,everyone! I met a strange illegal memory access error. It happens randomly without any regular pattern. The code is really simple.
Read more >
PyTorch CUDA error: an illegal memory access was ...
It was partially said by the answer of the OP, but the problem under the hood with illegal memory access is that the...
Read more >
CUDA error: an illegal memory access was encountered
Hi, all. I am getting a weird illegal memory access error whenever I try to train a FasterRCNN model with an image size...
Read more >
an illegal memory access was encountered cuda kernel errors ...
RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace ......
Read more >
CUDA error: an illegal memory access was encountered问题
解决RuntimeError: CUDA error: an illegal memory access was ... frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7b6453f9cd in ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found