question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi Gpus with pytorch backend problem.

See original GitHub issue

I am using Pytorch 0.4 with 4 GTX 1080Ti. When I run using pytorch backend and multi gpus, it gives me this error.

`# asr_train.py --ngpu 4 --backend pytorch --outdir exp/tr_en_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150/results --debugmode 1 --dict data/lang_1char/tr_en_units.txt --debugdir exp/tr_en_vggblstmp_e4_subsample1_2_2_1_1_unit320_proj320_d1_unit300_location_aconvc10_aconvf100_mtlalpha0.5_adadelta_bs30_mli800_mlo150 --minibatches 0 --verbose 0 --resume --train-json dump/tr_en/deltafalse/data.json --valid-json dump/dt_en/deltafalse/data.json --etype vggblstmp --elayers 4 --eunits 320 --eprojs 320 --subsample 1_2_2_1_1 --dlayers 1 --dunits 300 --atype location --aconv-chans 10 --aconv-filts 100 --mtlalpha 0.5 --batch-size 30 --maxlen-in 800 --maxlen-out 150 --opt adadelta --epochs 100

Started at Fri Jul 6 10:05:31 CST 2018

2018-07-06 10:05:31,582 (asr_train:146) WARNING: Skip DEBUG/INFO messages 2018-07-06 10:05:31,587 (asr_train:186) WARNING: CUDA_VISIBLE_DEVICES is not set. 2018-07-06 10:05:35,803 (e2e_asr_attctc_th:198) WARNING: Subsampling is not performed for vgg*. It is performed in max pooling layers at CNN. Exception in main training loop: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion output_nr == 0 failed. Traceback (most recent call last): File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 306, in run update() File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py”, line 149, in update self.update_core() File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 122, in update_core loss = 1. / self.num_gpu * self.model(x) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 491, in call result = self.forward(*input, **kwargs) File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply raise output Will finalize trainer extensions and updater before reraising the exception. Traceback (most recent call last): File “/home/lvzhuoran/code/espnet-master/egs/voxforge/asr1/…/…/…/src/bin/asr_train.py”, line 224, in <module> main() File “/home/lvzhuoran/code/espnet-master/egs/voxforge/asr1/…/…/…/src/bin/asr_train.py”, line 218, in main train(args) File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 377, in train trainer.run() File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 320, in run six.reraise(*sys.exc_info()) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py”, line 306, in run update() File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py”, line 149, in update self.update_core() File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 122, in update_core loss = 1. / self.num_gpu * self.model(x) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 491, in call result = self.forward(*input, **kwargs) File “/home/lvzhuoran/code/espnet-master/src/asr/asr_pytorch.py”, line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File “/home/lvzhuoran/code/espnet-master/tools/venv/local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply raise output RuntimeError: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion `output_nr == 0` failed.`

I am using Pytorch 0.4. I googled it and found this link to be useful.

Thanks, George.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:21 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
soumithcommented, Jul 27, 2018

DistributedDataParallel will be much better for RNNs. Please use that if possible, even on single node.

Have a look at our Launch Utility documentation that cleanly describes how to use DistributedDataParallel: https://pytorch.org/docs/stable/distributed.html#launch-utility

You can treat your training script has not getting a split input, which also simplifies your code a lot.

1reaction
miguelvrcommented, Jul 13, 2018

@bobchennan DistributedDataParallel works for single nodes and it has been proved to have much better performance than data parallel. Check this and this

Read more comments on GitHub >

github_iconTop Results From Across the Web

pytorch lightning examples doesn't work in multi gpu's with ...
It looks like hf sets ddp as the backend which is great because dp has a bunch of issues (this is a PyTorch...
Read more >
Multi GPU training with DDP - PyTorch
In this tutorial, we start with a single-GPU training script and migrate that to running it on 4 GPUs on a single node....
Read more >
Multi-GPU Computing with Pytorch (Draft)
Pytorch allows 'Gloo', 'MPI' and 'NCCL' as backends for parallelization. In general, Gloo is available on most Linux distros and should be used ......
Read more >
PyTorch 101, Part 4: Memory Management and Using Multiple ...
This article covers PyTorch's advanced GPU management features, including how to multiple GPU's for your network, whether be it data or model parallelism....
Read more >
Graphics Processing Unit (GPU) — PyTorch Lightning 1.6.2 ...
This is a limitation of using multiple processes for distributed training within PyTorch. To fix this issue, find your piece of code that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found