Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

got a gradient error in first epoch

See original GitHub issue

Have you met this error before? @sw005320

./run.sh --stage 4 --queue g.q --ngpu 4 --etype vggblstm --elayers 3 --eunits 1024 --eprojs 1024 --batchsize 16 --train_set train_nodev_perturb --maxlen_in 2200

0           19700       14.31       15.7234        12.8966                                                                                  0.871656                         70311         1e-08
Exception in main training loop: invalid gradient at index 0 - expected shape [2] but got [4]
Traceback (most recent call last):............................] 18.09%
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run                                                               
    update()ters/sec. Estimated time to finish: 102 days, 21:11:27.119712.
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update                                          
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward                                                                       
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward                                                            
    allow_unreachable=True)  # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>                                                                                   
    main()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main                                                                                       
    train(args)
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 365, in train
    trainer.run()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run                                                               
    six.reraise(*sys.exc_info())
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run                                                               
    update()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update                                          
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward                                                                       
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward                                                            
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [2] but got [4]
# Accounting: time=71697 threads=1
# Finished at Thu Nov 8 13:12:38 CST 2018 with status 1

Exception in main training loop: invalid gradient at index 0 - expected shape [3] but got [4]
Traceback (most recent call last):............................] 37.67%
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()ters/sec. Estimated time to finish: 8 days, 21:37:22.134266.
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>
    main()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main
    train(args)
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 362, in train
    trainer.run()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
  File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
    loss.backward(loss.new_ones(self.ngpu))  # Backprop
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [3] but got [4]
# Accounting: time=15618 threads=1
# Finished at Sun Oct 28 02:59:52 CST 2018 with status 1

Issue Analytics

State:
Created 5 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

2reactions

sw005320commented, Nov 10, 2018

#gpus x 2

0reactions

kan-bayashicommented, Nov 11, 2018

Thanks, @fanlu!

Top Results From Across the Web

Gradient disappearing after first epoch in manual linear ...

The issue seems to be that after the first epoch the gradient attribute is set to None, but I'm a little confused why...

review: gradient descent, epochs, validation in neural network ...

When all batches are traversed, 1 epoch is done. However, gradient descent hasn't minimized the loss function yet, the loss function still being ......

Torch.sigmoid function gradient issue after first epoch (Trying ...

Hi, I'm running the following code for an optimization problem. (The loss function here is just a simplified example).

Error with MQCNNEstimator in benchmark_m4 examples #1405

On epoch end, the program crashes with the error: gluonts.core.exception.GluonTSUserError: Got NaN in first epoch.

Difference Between a Batch and an Epoch in a Neural Network

The optimization algorithm is called “gradient descent“, where “gradient” refers to the calculation of an error gradient or slope of error ...