question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

got a gradient error in first epoch

See original GitHub issue

Have you met this error before? @sw005320

./run.sh --stage 4 --queue g.q --ngpu 4 --etype vggblstm --elayers 3 --eunits 1024 --eprojs 1024 --batchsize 16 --train_set train_nodev_perturb --maxlen_in 2200
0           19700       14.31       15.7234        12.8966                                                                                  0.871656                         70311         1e-08
Exception in main training loop: invalid gradient at index 0 - expected shape [2] but got [4]
Traceback (most recent call last):............................] 18.09%
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run                                                               
    update()ters/sec. Estimated time to finish: 102 days, 21:11:27.119712.
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update                                          
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward                                                                       
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward                                                            
    allow_unreachable=True)  # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>                                                                                   
    main()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main                                                                                       
    train(args)
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 365, in train
    trainer.run()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run                                                               
    six.reraise(*sys.exc_info())
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run                                                               
    update()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update                                          
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward                                                                       
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward                                                            
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [2] but got [4]
# Accounting: time=71697 threads=1
# Finished at Thu Nov 8 13:12:38 CST 2018 with status 1
Exception in main training loop: invalid gradient at index 0 - expected shape [3] but got [4]
Traceback (most recent call last):............................] 37.67%
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()ters/sec. Estimated time to finish: 8 days, 21:37:22.134266.
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>
    main()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main
    train(args)
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 362, in train
    trainer.run()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [3] but got [4]
# Accounting: time=15618 threads=1
# Finished at Sun Oct 28 02:59:52 CST 2018 with status 1

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
sw005320commented, Nov 10, 2018

#gpus x 2

0reactions
kan-bayashicommented, Nov 11, 2018

Thanks, @fanlu!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Gradient disappearing after first epoch in manual linear ...
The issue seems to be that after the first epoch the gradient attribute is set to None, but I'm a little confused why...
Read more >
review: gradient descent, epochs, validation in neural network ...
When all batches are traversed, 1 epoch is done. However, gradient descent hasn't minimized the loss function yet, the loss function still being ......
Read more >
Torch.sigmoid function gradient issue after first epoch (Trying ...
Hi, I'm running the following code for an optimization problem. (The loss function here is just a simplified example).
Read more >
Error with MQCNNEstimator in benchmark_m4 examples #1405
On epoch end, the program crashes with the error: gluonts.core.exception.GluonTSUserError: Got NaN in first epoch.
Read more >
Difference Between a Batch and an Epoch in a Neural Network
The optimization algorithm is called “gradient descent“, where “gradient” refers to the calculation of an error gradient or slope of error ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found