Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU issues with new devices [RTX 2080Ti/V100] inside docker containers

See original GitHub issue

I’ve been having a problem to execute the training in a server with V100, for chainer with mtlalpha > 0. I think that the code of WarpCTC is not supporting them. I just got a new gpu (RTX2080Ti) and is showing the same issue:

Traceback (most recent call last):
  File "/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 274, in <module>
    main(sys.argv[1:])
  File "/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 259, in main
    train(args)
  File "/espnet/espnet/asr/chainer_backend/asr.py", line 470, in train
    trainer.run()
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/training/trainer.py", line 329, in run
    six.reraise(*sys.exc_info())
  File "/espnet/tools/venv/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/espnet/espnet/asr/chainer_backend/asr.py", line 100, in update_core
    loss = optimizer.target(*x) / self.accum_grad
  File "/espnet/espnet/nets/chainer_backend/e2e_asr.py", line 90, in __call__
    loss_ctc = self.ctc(hs, ys)
  File "/espnet/espnet/nets/chainer_backend/ctc.py", line 95, in __call__
    self.loss = warp_ctc(y_hat, ilens, [cuda.to_cpu(l.data) for l in ys])[0]
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer_ctc/warpctc.py", line 125, in ctc
    return CTC(seq_lengths, labels)(x)
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/function.py", line 235, in __call__
    ret = node.apply(inputs)
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer/function.py", line 135, in forward
    return self._function.forward(inputs)
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer_ctc/warpctc.py", line 77, in forward
    raise e
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer_ctc/warpctc.py", line 71, in forward
    _ctc()
  File "/espnet/tools/venv/lib/python3.7/site-packages/chainer_ctc/warpctc.py", line 68, in _ctc
    stream.ptr)
  File "chainer_ctc/src/warp_ctc.pyx", line 67, in chainer_ctc.src.warp_ctc.ctc_compute_ctc_loss_gpu
  File "chainer_ctc/src/warp_ctc.pyx", line 87, in chainer_ctc.src.warp_ctc.ctc_compute_ctc_loss_gpu
  File "chainer_ctc/src/warp_ctc.pyx", line 47, in chainer_ctc.src.warp_ctc.check_status
chainer_ctc.src.warp_ctc.CTCError: CTC_STATUS_EXECUTION_FAILED: b'execution failed'

I did this with docker cuda 10.0 ( on a pc with ubuntu 18.04 and cuda 10.1) The training does not raise this error when the mtlalpha is set to 0.0 (but the WER is pretty high).

Also, I tried to test espnet with pytorch in the same GPU but i got the following error:

Traceback (most recent call last):
  File "/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 274, in <module>
    main(sys.argv[1:])
  File "/espnet/egs/an4/asr1/../../../espnet/bin/asr_train.py", line 262, in main
    train(args)
  File "/espnet/espnet/asr/pytorch_backend/asr.py", line 267, in train
    model = model.to(device)
  File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
    self.flatten_parameters()
  File "/espnet/tools/venv/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The image is the same for executing Chainer so I suppose that there should not be any bad setup on cudnn (This was also tested in another pc before update the docker containers in the hub). I will check the containers and upload a new ones, but I would like to know if this is may related to a specific device or not.

Issue Analytics

State:
Created 4 years ago
Comments:16 (13 by maintainers)

Top GitHub Comments

1reaction

Fhrozencommented, Jun 11, 2019

I just finished with the research and found a possible solution for this error: For cuda 10.0 and GTX-series seems that making warp-ctc with -gencode arch=compute_70,code=sm_70 has no problem, but for 10.1 and RTX (probably more the cuda version) this is no longer possible, so I took the recommendation from the code to build with gencode arch=compute_60,code=sm_70" and it successfully executed the docker container for both chainer/pytorch test. Both forks of warp-ctc for chainer/pytorch can be found here: https://github.com/Fhrozen/warp-ctc https://github.com/Fhrozen/chainer_ctc

I will recommend @jnishi to modify this in his warp-ctc fork to support CUDA10.1.

Another solution could be replacing since __shfl_down and __shfl_up with __shfl_sync but it seems to require a large modification in the .cu and .hpp files.

Currently, this is solve, but need to fix in ESPnet code 😉

0reactions

Fhrozencommented, Jun 13, 2019

RTX with CUDA 10.1 now is working with and observation that the pytorch backend is still not supported. I tested a docker container with espent 0.4 and pytorch 1.0.1 and it has not problem. So suppose that this error will be resolve when v.0.4.0 is merged.

For V100 and can not run test for now, but the only problem was chainer so I suppose that the containers that run in the newer version should run with V100. I will try to test once the server with a V100 is free.

Top Results From Across the Web

Using Your GPU in a Docker Container - Roboflow Blog

The NVIDIA Container Toolkit is the solution to configure your GPU within a Docker container. Follow this step-by-step guide to get started.

Enabling GPUs in the Container Runtime Ecosystem

NVIDIA designed NVIDIA-Docker in 2016 to enable portability in Docker images that leverage NVIDIA GPUs. It allowed driver agnostic CUDA images ...

Enabling GPU access with Compose - Docker Documentation

Enabling GPU access to service containers . Docker Compose v1.28.0+ allows to define GPU reservations using the device structure defined in the Compose ......

Best GPU for Deep Learning in 2022 (so far) - Lambda Labs

While waiting for NVIDIA's next-generation consumer and professional GPUs, we decided to write a blog about the best GPU for Deep Learning ...

Using GPU from a docker container? - cuda - Stack Overflow

Assuming the version mismatch is a problem, you could take this Dockerfile and edit it to have the CUDA 5.5 drivers, then rebuild...