MultiGPU error on WarpCTC
See original GitHub issueI prepared a new docker container with the updated tools, and when I tried to run with multigpu, I got this error. I am posting this here to know if someone else has a similar error with the new warpctc.
2018-07-26 13:51:18,146 (e2e_asr_attctc_th:99) INFO: mtl loss:417.586242676
Exception in main training loop: arguments are located on different GPUs at /pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:313
Traceback (most recent call last):
File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()
File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/espnet/src/asr/asr_pytorch.py", line 131, in update_core
loss.backward(torch.ones(self.num_gpu)) # Backprop
File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward
allow_unreachable=True) # allow_unreachable flag
File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 76, in apply
return self._forward_cls.backward(self, *args)
File "build/bdist.linux-x86_64/egg/warpctc_pytorch/__init__.py", line 50, in backward
return ctx.grads * grad_output.type_as(ctx.grads), None, None, None, None, None
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
File "/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py", line 231, in <module>
main()
File "/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py", line 225, in main
train(args)
File "/espnet/src/asr/asr_pytorch.py", line 383, in train
trainer.run()
File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()
File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/espnet/src/asr/asr_pytorch.py", line 131, in update_core
loss.backward(torch.ones(self.num_gpu)) # Backprop
File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward
allow_unreachable=True) # allow_unreachable flag
File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 76, in apply
return self._forward_cls.backward(self, *args)
File "build/bdist.linux-x86_64/egg/warpctc_pytorch/__init__.py", line 50, in backward
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:313
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (3 by maintainers)
Top Results From Across the Web
Can't train with multi gpu · Issue #1579 - GitHub
RuntimeError: CUDA error: out of memory and when i reduced batch size to 64, script can run but only one gpu ran (check...
Read more >Getting error in multi-gpu training with pytorch lightning
The below code works on a single GPU but throws an error while using multiple gpus RuntimeError: grad can be implicitly created only...
Read more >Unexplained Windows or software behavior may be caused ...
Describes behavior that may occur if your computer has deceptive software that is installed and running, and describes steps that you can take...
Read more >Terms and Conditions - IRCTC
If a user violates the terms and conditions of use by registering more than one User ID and or booking tickets on such...
Read more >train on multi-GPU server fail - Google Groups
but the process fails. there isn't any ERROR on log files. I attached logs. could you mind helping, how can I fix it?...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@dneelagiri and @alirezadir . Due to lack of time, I could not test any solution directly. I use multigpu on chainer. but the solution given by @weiwchu seems to be working. So it will be better if you rebuild the container modifying the required lines in /tools/Makefile
@Fhrozen I met the same issues as you. I dig into it, it seems there was some python binding issues with warp-ctc. But I don’t have time to modify that, so I rolled back to pytorch==0.3.1 other than using the latest pytorch 0.4, and also rolled back to Sean Naren’s warp-CTC:
Then I was good to go. Hope that could also work for you.