Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi Gpus with pytorch backend problem.

See original GitHub issue

I am using Pytorch 0.4 with 4 GTX 1080Ti. When I run using pytorch backend and multi gpus, it gives me this error.

`# asr_train.py --ngpu 4 --backend pytorch --outdir exp/tr_en_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/results --debugmode 1 --dict data/lang_1char/tr_en_units.txt --debugdir exp/tr_en_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150 --minibatches 0 --verbose 0 --resume --train-json dump/tr_en/deltafalse/data.json --valid-json dump/dt_en/deltafalse/data.json --etype vggblstmp --elayers 4 --eunits 320 --eprojs 320 --subsample 1_2_2_1_1 --dlayers 1 --dunits 300 --atype location --aconv-chans 10 --aconv-filts 100 --mtlalpha 0.5 --batch-size 30 --maxlen-in 800 --maxlen-out 150 --opt adadelta --epochs 100

Started at Fri Jul 6 10:05:31 CST 2018

2018-07-06 10:05:31,582 (asr_train:146) WARNING: Skip DEBUG/INFO messages 2018-07-06 10:05:31,587 (asr_train:186) WARNING: CUDA_VISIBLE_DEVICES is not set. 2018-07-06 10:05:35,803 (e2e_asr_attctc_th:198) WARNING: Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN. Exception in main training loop: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion output_nr == 0 failed. Traceback (most recent call last): File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 306, in run update() File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py”, line 149, in update self.update_core() File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 122, in update_core loss = 1. / self.num_gpu * self.model(x) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 491, in call result = self.forward(*input, **kwargs) File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply raise output Will finalize trainer extensions and updater before reraising the exception. [JTraceback (most recent call last): File “/home/lvzhuoran/code/espnet-master/egs/voxforge/asr1/…/…/…/src/bin/asr_train.py”, line 224, in <module> main() File “/home/lvzhuoran/code/espnet-master/egs/voxforge/asr1/…/…/…/src/bin/asr_train.py”, line 218, in main train(args) File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 377, in train trainer.run() File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 320, in run six.reraise(*sys.exc_info()) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 306, in run update() File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py”, line 149, in update self.update_core() File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 122, in update_core loss = 1. / self.num_gpu * self.model(x) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 491, in call result = self.forward(*input, **kwargs) File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply raise output RuntimeError: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion `output_nr == 0` failed.`

I am using Pytorch 0.4. I googled it and found this link to be useful.

Thanks, George.

Issue Analytics

State:
Created 5 years ago
Comments:21 (14 by maintainers)

Top GitHub Comments

2reactions

soumithcommented, Jul 27, 2018

DistributedDataParallel will be much better for RNNs. Please use that if possible, even on single node.

Have a look at our Launch Utility documentation that cleanly describes how to use DistributedDataParallel: https://pytorch.org/docs/stable/distributed.html#launch-utility

You can treat your training script has not getting a split input, which also simplifies your code a lot.

1reaction

miguelvrcommented, Jul 13, 2018

@bobchennan DistributedDataParallel works for single nodes and it has been proved to have much better performance than data parallel. Check this and this

Top Results From Across the Web

pytorch lightning examples doesn't work in multi gpu's with ...

It looks like hf sets ddp as the backend which is great because dp has a bunch of issues (this is a PyTorch...

Multi GPU training with DDP - PyTorch

In this tutorial, we start with a single-GPU training script and migrate that to running it on 4 GPUs on a single node....

Multi-GPU Computing with Pytorch (Draft)

Pytorch allows 'Gloo', 'MPI' and 'NCCL' as backends for parallelization. In general, Gloo is available on most Linux distros and should be used ......

PyTorch 101, Part 4: Memory Management and Using Multiple ...

This article covers PyTorch's advanced GPU management features, including how to multiple GPU's for your network, whether be it data or model parallelism....

Graphics Processing Unit (GPU) — PyTorch Lightning 1.6.2 ...

This is a limitation of using multiple processes for distributed training within PyTorch. To fix this issue, find your piece of code that...