Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training

See original GitHub issue

hi, I use aishell recipe and running asr_train.py. When I use single GPU, it work well. However, when I use 2GPU, it finish training at the end of first epoch and throw the error: KeyError: ‘validation/main/loss’ error.

[J/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
  warnings.warn("tight_layout : falling back to Agg renderer")
Exception in main training loop: 'validation/main/loss'
Traceback (most recent call last):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
    if entry.trigger(self):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
    value = float(stats[key])  # copy to CPU
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 368, in <module>
    main(sys.argv[1:])
  File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 355, in main
    train(args)
  File "/home/lcf/espnet/espnet/asr/pytorch_backend/asr.py", line 631, in train
    trainer.run()
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 376, in run
    six.reraise(*exc_info)
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
    if entry.trigger(self):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
    value = float(stats[key])  # copy to CPU
KeyError: 'validation/main/loss'
# Accounting: time=129 threads=1
# Ended (code 1) at Mon Mar 23 12:46:57 CST 2020, elapsed time 129 seconds

How to fix this?

Issue Analytics

State:
Created 3 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

sw005320commented, Mar 25, 2020

Cool! This is a very good note. We may need to stick to use chainer 6.0.0. I want to leave it as it is (because our default chainer version in Makefile is 6.0.0) but if many people are stacked about it due to that, we’ll ask people to fix to use chainer 6.0.0 or fix this bug.

0reactions

LCF2764commented, Mar 25, 2020

Thank you very mush! After I instasll chainer=6.0.0, the problem was solved. Thanks!