ASR training hangs in epoch 0, after few iterations
See original GitHub issueHi,
using Espnet commit: 18ed8b0d76ae4bb32ce901152fdb35d1fc7484e4 - Tue Aug 28 10:56:46 2018 -0400 Pytorch: 0.4.1
Trying out librispeech. The training just stops (hangs) in epoch 0 after few iterations.
I am using pytorch backend with ngpus=4. There is no error in the log.
tail -f train.log 0 300 288.4 324.985 251.815 0.343726 456.825 1e-08 total [#.................................................] 3.62% this epoch [###########################.......................] 54.35% 300 iter, 0 epoch / 15 epochs 0.69902 iters/sec. Estimated time to finish: 3:10:15.971187.
Output of nvidia-smi. GPU utilization remains at zero after few iterations
using cuda-8.0.61 and cudnn-6
Any comments on this?
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (11 by maintainers)
Top GitHub Comments
same problem still. Considering re-write IO part with pytorch dataloader.
https://docs.chainer.org/en/stable/reference/generated/chainer.iterators.MultiprocessIterator.html
I met the same problem.
MultiProcessIterator
is a buggy, I agree with @bobchennan.