Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MultiGPU error on WarpCTC

See original GitHub issue

I prepared a new docker container with the updated tools, and when I tried to run with multigpu, I got this error. I am posting this here to know if someone else has a similar error with the new warpctc.

2018-07-26 13:51:18,146 (e2e_asr_attctc_th:99) INFO: mtl loss:417.586242676
Exception in main training loop: arguments are located on different GPUs at /pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:313
Traceback (most recent call last):
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/espnet/src/asr/asr_pytorch.py", line 131, in update_core
    loss.backward(torch.ones(self.num_gpu))  # Backprop
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 76, in apply
    return self._forward_cls.backward(self, *args)
  File "build/bdist.linux-x86_64/egg/warpctc_pytorch/__init__.py", line 50, in backward
    return ctx.grads * grad_output.type_as(ctx.grads), None, None, None, None, None
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
  File "/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py", line 231, in <module>
    main()
  File "/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py", line 225, in main
    train(args)
  File "/espnet/src/asr/asr_pytorch.py", line 383, in train
    trainer.run()
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/espnet/src/asr/asr_pytorch.py", line 131, in update_core
    loss.backward(torch.ones(self.num_gpu))  # Backprop
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 76, in apply
    return self._forward_cls.backward(self, *args)
  File "build/bdist.linux-x86_64/egg/warpctc_pytorch/__init__.py", line 50, in backward

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:313

Issue Analytics

State:
Created 5 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

Fhrozencommented, Aug 17, 2018

@dneelagiri and @alirezadir . Due to lack of time, I could not test any solution directly. I use multigpu on chainer. but the solution given by @weiwchu seems to be working. So it will be better if you rebuild the container modifying the required lines in /tools/Makefile

1reaction

weiwchucommented, Aug 7, 2018

@Fhrozen I met the same issues as you. I dig into it, it seems there was some python binding issues with warp-ctc. But I don’t have time to modify that, so I rolled back to pytorch==0.3.1 other than using the latest pytorch 0.4, and also rolled back to Sean Naren’s warp-CTC:

git clone https://github.com/SeanNaren/warp-ctc.git
git clone https://github.com/jnishi/warp-ctc.git
. venv/bin/activate; cd warp-ctc && git checkout 9e5b238f8d9337b0c39b3fd01bbaff98ba523aa5

Then I was good to go. Hope that could also work for you.