Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ASR training hangs in epoch 0, after few iterations

See original GitHub issue

Hi,

using Espnet commit: 18ed8b0d76ae4bb32ce901152fdb35d1fc7484e4 - Tue Aug 28 10:56:46 2018 -0400 Pytorch: 0.4.1

Trying out librispeech. The training just stops (hangs) in epoch 0 after few iterations.

I am using pytorch backend with ngpus=4. There is no error in the log.

tail -f train.log 0 300 288.4 324.985 251.815 0.343726 456.825 1e-08 total [#.................................................] 3.62% this epoch [###########################.......................] 54.35% 300 iter, 0 epoch / 15 epochs 0.69902 iters/sec. Estimated time to finish: 3:10:15.971187.

Output of nvidia-smi. GPU utilization remains at zero after few iterations

using cuda-8.0.61 and cudnn-6

Any comments on this?

Issue Analytics

State:
Created 5 years ago
Comments:15 (11 by maintainers)

Top GitHub Comments

1reaction

bobchennancommented, Aug 22, 2019

same problem still. Considering re-write IO part with pytorch dataloader.

chainer/iterators/multiprocess_iterator.py:28: TimeoutWarning: Stalled dataset is detected. 
See the documentation of MultiprocessIterator for common causes and workarounds:

https://docs.chainer.org/en/stable/reference/generated/chainer.iterators.MultiprocessIterator.html

0reactions

kan-bayashicommented, Aug 24, 2019

I met the same problem. MultiProcessIterator is a buggy, I agree with @bobchennan.