using c10d: illegal memory access
See original GitHub issueI am working on translation task and I am trying to overlap communication with backward pass.
I upgraded PyTorch from source. When I try c10d(default) with fp16 I am getting the following error. Is there a specific Pytorch versionI need to use for c10d? Previously, I was able to run on multi GPU, fp16 with Pytorch 0.4.1(not using c10d).
Running on 8 GPUs. NCCL version 2.3.5+cuda9.2 CUDA 9.2 Pytorch 1.0.0a0+d4f9dbf
| epoch 001: 0%| | 7/5472 [00:01<22:05, 4.12it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4265, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scal| epoch 001: 0%| | 7/5472 [00:02<25:06, 3.63it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4282, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scal| epoch 001: 0%| | 7/5472 [00:01<22:08, 4.11it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4335, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scal| epoch 001: 0%| | 7/5472 [00:01<23:45, 3.83it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4322, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scal| epoch 001: 0%| | 7/5472 [00:01<20:41, 4.40it/s, loss=15.849, nll_loss=15.862, ppl=59571.10, wps=4292, ups=0.2, wpb=25277, bsz=756, num_updates=6, lr=1.59985e-06, gnorm=5.906, clip=0%, oom=0, loss_scale=64.000, wall=35, train_wall=2] THCudaCheck FAIL file=/home/ubuntu/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=271 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 352, in <module> distributed_main(args) File "/home/ubuntu/fairseq/distributed_train.py", line 39, in main single_process_main(args) File "/home/ubuntu/fairseq/train.py", line 90, in main train(args, trainer, task, epoch_itr) File "/home/ubuntu/fairseq/train.py", line 125, in train log_output = trainer.train_step(samples) File "/home/ubuntu/fairseq/fairseq/trainer.py", line 194, in train_step raise e File "/home/ubuntu/fairseq/fairseq/trainer.py", line 176, in train_step ignore_grad File "/home/ubuntu/fairseq/fairseq/tasks/fairseq_task.py", line 174, in train_step optimizer.backward(loss) File "/home/ubuntu/fairseq/fairseq/optim/fp16_optimizer.py", line 102, in backward loss.backward() File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in distributed_data_parallel_hook self._queue_reduction(bucket_idx) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 405, in _queue_reduction self.device_ids) RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /home/ubuntu/pytorch/aten/src/THC/THCCachingHostAllocator.cpp:271 ip-10-0-0-168:9882:9882 [5] init.cu:117 NCCL WARN Cuda failure 'an illegal memory access was encountered' ip-10-0-0-168:9882:9882 [5] NCCL INFO init.cu:772 -> 1 terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: /home/ubuntu/pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
It is working in fp32, I have not seen an error for 2 hours. Data: wmt16_en_de_bpe32k Run script:
This is fixed in the latest pytorch.