question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ASR training hangs in epoch 0, after few iterations

See original GitHub issue

Hi,

using Espnet commit: 18ed8b0d76ae4bb32ce901152fdb35d1fc7484e4 - Tue Aug 28 10:56:46 2018 -0400 Pytorch: 0.4.1

Trying out librispeech. The training just stops (hangs) in epoch 0 after few iterations.

I am using pytorch backend with ngpus=4. There is no error in the log.

tail -f train.log 0 300 288.4 324.985 251.815 0.343726 456.825 1e-08 total [#.................................................] 3.62% this epoch [###########################.......................] 54.35% 300 iter, 0 epoch / 15 epochs 0.69902 iters/sec. Estimated time to finish: 3:10:15.971187.

Output of nvidia-smi. GPU utilization remains at zero after few iterations

screen shot 2018-09-01 at 3 48 54 pm

using cuda-8.0.61 and cudnn-6

Any comments on this?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
bobchennancommented, Aug 22, 2019

same problem still. Considering re-write IO part with pytorch dataloader.

chainer/iterators/multiprocess_iterator.py:28: TimeoutWarning: Stalled dataset is detected. 
See the documentation of MultiprocessIterator for common causes and workarounds:

https://docs.chainer.org/en/stable/reference/generated/chainer.iterators.MultiprocessIterator.html

0reactions
kan-bayashicommented, Aug 24, 2019

I met the same problem. MultiProcessIterator is a buggy, I agree with @bobchennan.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keras model training hanging on first epoch - Stack Overflow
It looks like the generator is worng since they display 0 images found. ... The number of steps for training and validation should...
Read more >
Change the configuration for training — ESPnet 202211 ...
Change the number of iterations in each epoch¶. By default, an epoch indicates using up whole data in the training corpus and the...
Read more >
Trainer — PyTorch Lightning 1.8.5.post0 documentation
You can perform an evaluation epoch over the validation set, outside of the training loop, using validate() . This might be useful if...
Read more >
Unfreezing the Layers You Want to Fine-Tune Using Transfer ...
After having done that, however, I was free in the next iteration to ... training a classifier on the frozen base first, initial...
Read more >
When I train convolutional neural network, the gradient always ...
From Epoch vs iteration when training neural networks : ... After some thought, I do not believe that the pooling operation is the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found