Inconsistent epoch count after training restart with
See original GitHub issueHello team,
following issue #780, I tried to re-start my training changing the --max-tokens
option from 1536
to 1500
(why so small difference? the idea is to have a more lucky batch composition, more details on issue #780).
The problem is that the log has some very strange steps-epoch counter. This is the log of the very last checkpoint with --max-tokens=1536
:
| epoch 002: 14300 / 41687 loss=3.212, nll_loss=1.498, ppl=2.82, wps=33243, ups=0, wpb=81226.576, bsz=4527.015, num_updates=55988, lr=0.000133645, gnorm=0.214, clip=0.000, oom=0.000, wall=139674, train_wall=133506
| epoch 002 | valid on 'valid' subset | loss 3.143 | nll_loss 1.356 | ppl 2.56 | num_updates 56000 | best_loss 3.14275
So one epoch is 41687
batches, and the checkpoint has been dropped at 14313
batches from second epoch begin (56000
from the beginning).
But when I restart the training with --max-tokens=1500
, this happens:
| loaded checkpoint /home/ubuntu/training/checkpoint_last.pt (epoch 2 @ 56000 updates)
| epoch 002: 100 / 28937 loss=3.211, nll_loss=1.498, ppl=2.82, wps=8809, ups=0, wpb=81205.580, bsz=4526.309, num_updates=56101, lr=0.00013351, gnorm=0.214, clip=0.000, oom=0.000, wall=897, train_wall=133887
So now one epoch is only 28937
updates (?) and the starting step is 0
and not 14313
. What is actually happening?
Thanks for your help!
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Inconsistent epoch count after training restart with #781 - GitHub
Hello team, following issue #780, I tried to re-start my training changing the --max-tokens option from 1536 to 1500 (why so small ...
Read more >Why do I have such inconsistent results when training my ...
Same for the order of the dataset. In each epoch your training data is shuffled and this will be different in every for...
Read more >Why Do I Get Different Results Each Time in Machine Learning?
After completing this tutorial, you will know: Machine learning algorithms will train different models if the training dataset is changed.
Read more >What should I do when my neural network doesn't learn?
See if the norm of the weights is increasing abnormally with epochs. if you're getting some error at training time, google that error....
Read more >Classification on imbalanced data | TensorFlow Core
This tutorial demonstrates how to classify a highly imbalanced dataset in which the number of examples in one class greatly outnumbers the examples...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@myleott that sounds exactly the problem I’m facing, I would love to test it as soon as it’s merged into master.
Meanwhile, the training has completed the second epoch, and everything seems to be good now:
So this seems just a transient problem, and not really a bug, more a visualization issue.
@lematt1991 the current command I’m using depends upon libraries in our tool - ModernMT so it would be difficult for you to reproduce. I’ll try to reproduce the problem on a raw fairseq execution, but I think it should be quite reproducible: just restart an interrupted training.
Thanks guys for your help!
Fixed by #778