Multi Gpus with pytorch backend problem.
See original GitHub issueI am using Pytorch 0.4 with 4 GTX 1080Ti. When I run using pytorch backend and multi gpus, it gives me this error.
`# asr_train.py --ngpu 4 --backend pytorch --outdir exp/tr_en_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/results --debugmode 1 --dict data/lang_1char/tr_en_units.txt --debugdir exp/tr_en_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150 --minibatches 0 --verbose 0 --resume --train-json dump/tr_en/deltafalse/data.json --valid-json dump/dt_en/deltafalse/data.json --etype vggblstmp --elayers 4 --eunits 320 --eprojs 320 --subsample 1_2_2_1_1 --dlayers 1 --dunits 300 --atype location --aconv-chans 10 --aconv-filts 100 --mtlalpha 0.5 --batch-size 30 --maxlen-in 800 --maxlen-out 150 --opt adadelta --epochs 100
Started at Fri Jul 6 10:05:31 CST 2018
2018-07-06 10:05:31,582 (asr_train:146) WARNING: Skip DEBUG/INFO messages
2018-07-06 10:05:31,587 (asr_train:186) WARNING: CUDA_VISIBLE_DEVICES is not set.
2018-07-06 10:05:35,803 (e2e_asr_attctc_th:198) WARNING: Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN.
Exception in main training loop: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion output_nr == 0
failed.
Traceback (most recent call last):
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 306, in run
update()
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py”, line 149, in update
self.update_core()
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 122, in update_core
loss = 1. / self.num_gpu * self.model(x)
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply
raise output
Will finalize trainer extensions and updater before reraising the exception.
[JTraceback (most recent call last):
File “/home/lvzhuoran/code/espnet-master/egs/voxforge/asr1/…/…/…/src/bin/asr_train.py”, line 224, in <module>
main()
File “/home/lvzhuoran/code/espnet-master/egs/voxforge/asr1/…/…/…/src/bin/asr_train.py”, line 218, in main
train(args)
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 377, in train
trainer.run()
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 320, in run
six.reraise(*sys.exc_info())
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 306, in run
update()
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py”, line 149, in update
self.update_core()
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 122, in update_core
loss = 1. / self.num_gpu * self.model(x)
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply
raise output
RuntimeError: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion `output_nr == 0` failed.`
I am using Pytorch 0.4. I googled it and found this link to be useful.
Thanks, George.
Issue Analytics
- State:
- Created 5 years ago
- Comments:21 (14 by maintainers)
Top GitHub Comments
DistributedDataParallel will be much better for RNNs. Please use that if possible, even on single node.
Have a look at our Launch Utility documentation that cleanly describes how to use DistributedDataParallel: https://pytorch.org/docs/stable/distributed.html#launch-utility
You can treat your training script has not getting a split input, which also simplifies your code a lot.
@bobchennan DistributedDataParallel works for single nodes and it has been proved to have much better performance than data parallel. Check this and this