question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MultiGPU error on WarpCTC

See original GitHub issue

I prepared a new docker container with the updated tools, and when I tried to run with multigpu, I got this error. I am posting this here to know if someone else has a similar error with the new warpctc.

2018-07-26 13:51:18,146 (e2e_asr_attctc_th:99) INFO: mtl loss:417.586242676
Exception in main training loop: arguments are located on different GPUs at /pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:313
Traceback (most recent call last):
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/espnet/src/asr/asr_pytorch.py", line 131, in update_core
    loss.backward(torch.ones(self.num_gpu))  # Backprop
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 76, in apply
    return self._forward_cls.backward(self, *args)
  File "build/bdist.linux-x86_64/egg/warpctc_pytorch/__init__.py", line 50, in backward
    return ctx.grads * grad_output.type_as(ctx.grads), None, None, None, None, None
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
  File "/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py", line 231, in <module>
    main()
  File "/espnet/egs/voxforge/asr1/../../../src/bin/asr_train.py", line 225, in main
    train(args)
  File "/espnet/src/asr/asr_pytorch.py", line 383, in train
    trainer.run()
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/espnet/src/asr/asr_pytorch.py", line 131, in update_core
    loss.backward(torch.ones(self.num_gpu))  # Backprop
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/espnet/tools/venv/local/lib/python2.7/site-packages/torch/autograd/function.py", line 76, in apply
    return self._forward_cls.backward(self, *args)
  File "build/bdist.linux-x86_64/egg/warpctc_pytorch/__init__.py", line 50, in backward

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:313

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Fhrozencommented, Aug 17, 2018

@dneelagiri and @alirezadir . Due to lack of time, I could not test any solution directly. I use multigpu on chainer. but the solution given by @weiwchu seems to be working. So it will be better if you rebuild the container modifying the required lines in /tools/Makefile

1reaction
weiwchucommented, Aug 7, 2018

@Fhrozen I met the same issues as you. I dig into it, it seems there was some python binding issues with warp-ctc. But I don’t have time to modify that, so I rolled back to pytorch==0.3.1 other than using the latest pytorch 0.4, and also rolled back to Sean Naren’s warp-CTC:

git clone https://github.com/SeanNaren/warp-ctc.git
git clone https://github.com/jnishi/warp-ctc.git
. venv/bin/activate; cd warp-ctc && git checkout 9e5b238f8d9337b0c39b3fd01bbaff98ba523aa5 

Then I was good to go. Hope that could also work for you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can't train with multi gpu · Issue #1579 - GitHub
RuntimeError: CUDA error: out of memory and when i reduced batch size to 64, script can run but only one gpu ran (check...
Read more >
Getting error in multi-gpu training with pytorch lightning
The below code works on a single GPU but throws an error while using multiple gpus RuntimeError: grad can be implicitly created only...
Read more >
Unexplained Windows or software behavior may be caused ...
Describes behavior that may occur if your computer has deceptive software that is installed and running, and describes steps that you can take...
Read more >
Terms and Conditions - IRCTC
If a user violates the terms and conditions of use by registering more than one User ID and or booking tickets on such...
Read more >
train on multi-GPU server fail - Google Groups
but the process fails. there isn't any ERROR on log files. I attached logs. could you mind helping, how can I fix it?...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found