GPU issues with new devices [RTX 2080Ti/V100] inside docker containers
See original GitHub issueI’ve been having a problem to execute the training in a server with V100, for chainer with mtlalpha > 0. I think that the code of WarpCTC is not supporting them. I just got a new gpu (RTX2080Ti) and is showing the same issue:
Traceback (most recent call last):
File "/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 274, in <module>
main(sys.argv[1:])
File "/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 259, in main
train(args)
File "/espnet/espnet/asr/chainer_backend/asr.py", line 470, in train
trainer.run()
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/training/trainer.py", line 329, in run
six.reraise(*sys.exc_info())
File "/espnet/tools/venv/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
update()
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
self.update_core()
File "/espnet/espnet/asr/chainer_backend/asr.py", line 100, in update_core
loss = optimizer.target(*x) / self.accum_grad
File "/espnet/espnet/nets/chainer_backend/e2e_asr.py", line 90, in __call__
loss_ctc = self.ctc(hs, ys)
File "/espnet/espnet/nets/chainer_backend/ctc.py", line 95, in __call__
self.loss = warp_ctc(y_hat, ilens, [cuda.to_cpu(l.data) for l in ys])[0]
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer_ctc/warpctc.py", line 125, in ctc
return CTC(seq_lengths, labels)(x)
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/function.py", line 235, in __call__
ret = node.apply(inputs)
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
outputs = self.forward(in_data)
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/function.py", line 135, in forward
return self._function.forward(inputs)
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer_ctc/warpctc.py", line 77, in forward
raise e
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer_ctc/warpctc.py", line 71, in forward
_ctc()
File "/espnet/tools/venv/lib/python3.7/site-packages/chainer_ctc/warpctc.py", line 68, in _ctc
stream.ptr)
File "chainer_ctc/src/warp_ctc.pyx", line 67, in chainer_ctc.src.warp_ctc.ctc_compute_ctc_loss_gpu
File "chainer_ctc/src/warp_ctc.pyx", line 87, in chainer_ctc.src.warp_ctc.ctc_compute_ctc_loss_gpu
File "chainer_ctc/src/warp_ctc.pyx", line 47, in chainer_ctc.src.warp_ctc.check_status
chainer_ctc.src.warp_ctc.CTCError: CTC_STATUS_EXECUTION_FAILED: b'execution failed'
I did this with docker cuda 10.0 ( on a pc with ubuntu 18.04 and cuda 10.1) The training does not raise this error when the mtlalpha is set to 0.0 (but the WER is pretty high).
Also, I tried to test espnet with pytorch in the same GPU but i got the following error:
Traceback (most recent call last):
File "/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 274, in <module>
main(sys.argv[1:])
File "/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 262, in main
train(args)
File "/espnet/espnet/asr/pytorch_backend/asr.py", line 267, in train
model = model.to(device)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
self.flatten_parameters()
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
The image is the same for executing Chainer so I suppose that there should not be any bad setup on cudnn (This was also tested in another pc before update the docker containers in the hub). I will check the containers and upload a new ones, but I would like to know if this is may related to a specific device or not.
Issue Analytics
- State:
- Created 4 years ago
- Comments:16 (13 by maintainers)
Top GitHub Comments
I just finished with the research and found a possible solution for this error: For cuda 10.0 and GTX-series seems that making warp-ctc with
-gencode arch=compute_70,code=sm_70
has no problem, but for 10.1 and RTX (probably more the cuda version) this is no longer possible, so I took the recommendation from the code to build withgencode arch=compute_60,code=sm_70"
and it successfully executed the docker container for both chainer/pytorch test. Both forks of warp-ctc for chainer/pytorch can be found here: https://github.com/Fhrozen/warp-ctc https://github.com/Fhrozen/chainer_ctcI will recommend @jnishi to modify this in his warp-ctc fork to support CUDA10.1.
Another solution could be replacing since
__shfl_down
and__shfl_up
with__shfl_sync
but it seems to require a large modification in the .cu and .hpp files.Currently, this is solve, but need to fix in ESPnet code 😉
RTX with CUDA 10.1 now is working with and observation that the pytorch backend is still not supported. I tested a docker container with espent 0.4 and pytorch 1.0.1 and it has not problem. So suppose that this error will be resolve when v.0.4.0 is merged.
For V100 and can not run test for now, but the only problem was chainer so I suppose that the containers that run in the newer version should run with V100. I will try to test once the server with a V100 is free.