KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training
See original GitHub issuehi, I use aishell recipe and running asr_train.py. When I use single GPU, it work well. However, when I use 2GPU, it finish training at the end of first epoch and throw the error: KeyError: ‘validation/main/loss’ error.
[J/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
warnings.warn("tight_layout : falling back to Agg renderer")
Exception in main training loop: 'validation/main/loss'
Traceback (most recent call last):
File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
if entry.trigger(self):
File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
value = float(stats[key]) # copy to CPU
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 368, in <module>
main(sys.argv[1:])
File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 355, in main
train(args)
File "/home/lcf/espnet/espnet/asr/pytorch_backend/asr.py", line 631, in train
trainer.run()
File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 376, in run
six.reraise(*exc_info)
File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
if entry.trigger(self):
File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
value = float(stats[key]) # copy to CPU
KeyError: 'validation/main/loss'
# Accounting: time=129 threads=1
# Ended (code 1) at Mon Mar 23 12:46:57 CST 2020, elapsed time 129 seconds
How to fix this?
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Error When Using Multiple GPus
My code works fine when using just 1 GPU using torch.cuda.set_device(0) but it takes a lot of time to train in single GPU....
Read more >Distributed GPU training guide (SDK v2) - Azure
Learn more about how to use distributed GPU training code in Azure Machine Learning (ML). This article will not teach you about distributed...
Read more >Efficient Training on Multiple GPUs
This is a built-in feature of Pytorch. Note that in general it is advised to use DDP as it is better maintained and...
Read more >Training Deep Neural Network using Data Parallel?
Hi, I need to train some simpler networks on large datasets and would like to use multiple GPUs for training. Is there a...
Read more >How to use multiple GPUs in pytorch? - python
PyTorch Lightning Multi-GPU training. This is of possible the best option IMHO to train on CPU/GPU/TPU without changing your original PyTorch ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Cool! This is a very good note. We may need to stick to use
chainer 6.0.0
. I want to leave it as it is (because our default chainer version in Makefile is 6.0.0) but if many people are stacked about it due to that, we’ll ask people to fix to use chainer 6.0.0 or fix this bug.Thank you very mush! After I instasll chainer=6.0.0, the problem was solved. Thanks!