question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training

See original GitHub issue

hi, I use aishell recipe and running asr_train.py. When I use single GPU, it work well. However, when I use 2GPU, it finish training at the end of first epoch and throw the error: KeyError: ‘validation/main/loss’ error.

/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
  warnings.warn("tight_layout : falling back to Agg renderer")
Exception in main training loop: 'validation/main/loss'
Traceback (most recent call last):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
    if entry.trigger(self):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
    value = float(stats[key])  # copy to CPU
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 368, in <module>
    main(sys.argv[1:])
  File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 355, in main
    train(args)
  File "/home/lcf/espnet/espnet/asr/pytorch_backend/asr.py", line 631, in train
    trainer.run()
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 376, in run
    six.reraise(*exc_info)
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
    if entry.trigger(self):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
    value = float(stats[key])  # copy to CPU
KeyError: 'validation/main/loss'
# Accounting: time=129 threads=1
# Ended (code 1) at Mon Mar 23 12:46:57 CST 2020, elapsed time 129 seconds

How to fix this?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
sw005320commented, Mar 25, 2020

Cool! This is a very good note. We may need to stick to use chainer 6.0.0. I want to leave it as it is (because our default chainer version in Makefile is 6.0.0) but if many people are stacked about it due to that, we’ll ask people to fix to use chainer 6.0.0 or fix this bug.

0reactions
LCF2764commented, Mar 25, 2020

Thank you very mush! After I instasll chainer=6.0.0, the problem was solved. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error When Using Multiple GPus
My code works fine when using just 1 GPU using torch.cuda.set_device(0) but it takes a lot of time to train in single GPU....
Read more >
Distributed GPU training guide (SDK v2) - Azure
Learn more about how to use distributed GPU training code in Azure Machine Learning (ML). This article will not teach you about distributed...
Read more >
Efficient Training on Multiple GPUs
This is a built-in feature of Pytorch. Note that in general it is advised to use DDP as it is better maintained and...
Read more >
Training Deep Neural Network using Data Parallel?
Hi, I need to train some simpler networks on large datasets and would like to use multiple GPUs for training. Is there a...
Read more >
How to use multiple GPUs in pytorch? - python
PyTorch Lightning Multi-GPU training. This is of possible the best option IMHO to train on CPU/GPU/TPU without changing your original PyTorch ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found