got a gradient error in first epoch
See original GitHub issueHave you met this error before? @sw005320
./run.sh --stage 4 --queue g.q --ngpu 4 --etype vggblstm --elayers 3 --eunits 1024 --eprojs 1024 --batchsize 16 --train_set train_nodev_perturb --maxlen_in 2200
0 19700 14.31 15.7234 12.8966 0.871656 70311 1e-08
Exception in main training loop: invalid gradient at index 0 - expected shape [2] but got [4]
Traceback (most recent call last):............................] 18.09%
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()ters/sec. Estimated time to finish: 102 days, 21:11:27.119712.
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
loss.backward(loss.new_ones(self.ngpu)) # Backprop
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>
main()
File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main
train(args)
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 365, in train
trainer.run()
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
loss.backward(loss.new_ones(self.ngpu)) # Backprop
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [2] but got [4]
# Accounting: time=71697 threads=1
# Finished at Thu Nov 8 13:12:38 CST 2018 with status 1
Exception in main training loop: invalid gradient at index 0 - expected shape [3] but got [4]
Traceback (most recent call last):............................] 37.67%
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()ters/sec. Estimated time to finish: 8 days, 21:37:22.134266.
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
loss.backward(loss.new_ones(self.ngpu)) # Backprop
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 197, in <module>
main()
File "/mnt/cephfs2/asr/users/fanlu/espnet/egs/kefu/asr1/../../../src/bin/asr_train.py", line 191, in main
train(args)
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 362, in train
trainer.run()
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
six.reraise(*sys.exc_info())
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
update()
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
self.update_core()
File "/mnt/cephfs2/asr/users/fanlu/espnet/src/asr/asr_pytorch.py", line 123, in update_core
loss.backward(loss.new_ones(self.ngpu)) # Backprop
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/cephfs2/asr/users/fanlu/miniconda3/envs/py2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [3] but got [4]
# Accounting: time=15618 threads=1
# Finished at Sun Oct 28 02:59:52 CST 2018 with status 1
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (15 by maintainers)
Top Results From Across the Web
Gradient disappearing after first epoch in manual linear ...
The issue seems to be that after the first epoch the gradient attribute is set to None, but I'm a little confused why...
Read more >review: gradient descent, epochs, validation in neural network ...
When all batches are traversed, 1 epoch is done. However, gradient descent hasn't minimized the loss function yet, the loss function still being ......
Read more >Torch.sigmoid function gradient issue after first epoch (Trying ...
Hi, I'm running the following code for an optimization problem. (The loss function here is just a simplified example).
Read more >Error with MQCNNEstimator in benchmark_m4 examples #1405
On epoch end, the program crashes with the error: gluonts.core.exception.GluonTSUserError: Got NaN in first epoch.
Read more >Difference Between a Batch and an Epoch in a Neural Network
The optimization algorithm is called “gradient descent“, where “gradient” refers to the calculation of an error gradient or slope of error ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
#gpus x 2
Thanks, @fanlu!