question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent epoch count after training restart with

See original GitHub issue

Hello team,

following issue #780, I tried to re-start my training changing the --max-tokens option from 1536 to 1500 (why so small difference? the idea is to have a more lucky batch composition, more details on issue #780).

The problem is that the log has some very strange steps-epoch counter. This is the log of the very last checkpoint with --max-tokens=1536:

| epoch 002:  14300 / 41687 loss=3.212, nll_loss=1.498, ppl=2.82, wps=33243, ups=0, wpb=81226.576, bsz=4527.015, num_updates=55988, lr=0.000133645, gnorm=0.214, clip=0.000, oom=0.000, wall=139674, train_wall=133506
| epoch 002 | valid on 'valid' subset | loss 3.143 | nll_loss 1.356 | ppl 2.56 | num_updates 56000 | best_loss 3.14275

So one epoch is 41687 batches, and the checkpoint has been dropped at 14313 batches from second epoch begin (56000 from the beginning).

But when I restart the training with --max-tokens=1500, this happens:

| loaded checkpoint /home/ubuntu/training/checkpoint_last.pt (epoch 2 @ 56000 updates)
| epoch 002:    100 / 28937 loss=3.211, nll_loss=1.498, ppl=2.82, wps=8809, ups=0, wpb=81205.580, bsz=4526.309, num_updates=56101, lr=0.00013351, gnorm=0.214, clip=0.000, oom=0.000, wall=897, train_wall=133887

So now one epoch is only 28937 updates (?) and the starting step is 0 and not 14313. What is actually happening?

Thanks for your help!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
davidecarosellicommented, Jun 7, 2019

@myleott that sounds exactly the problem I’m facing, I would love to test it as soon as it’s merged into master.

Meanwhile, the training has completed the second epoch, and everything seems to be good now:

| epoch 002:  25200 / 25374 loss=3.190, nll_loss=1.475, ppl=2.78, wps=32866, ups=0, wpb=81208.899, bsz=4525.854, num_updates=83201, lr=0.000109632, gnorm=0.210, clip=0.000, oom=0.000, wall=62266, train_wall=196888
| epoch 002:  25300 / 25374 loss=3.190, nll_loss=1.475, ppl=2.78, wps=32868, ups=0, wpb=81208.972, bsz=4525.815, num_updates=83301, lr=0.000109566, gnorm=0.210, clip=0.000, oom=0.000, wall=62509, train_wall=197126
| epoch 002 | loss 3.190 | nll_loss 1.475 | ppl 2.78 | wps 32870 | ups 0 | wpb 81207.526 | bsz 4525.937 | num_updates 83374 | lr 0.000109518 | gnorm 0.210 | clip 0.000 | oom 0.000 | wall 62685 | train_wall 197299
| epoch 002 | valid on 'valid' subset | loss 3.106 | nll_loss 1.322 | ppl 2.50 | num_updates 83374 | best_loss 3.10556
| epoch 003:    100 / 41687 loss=3.156, nll_loss=1.440, ppl=2.71, wps=33084, ups=0, wpb=81065.931, bsz=4570.842, num_updates=83475, lr=0.000109452, gnorm=0.207, clip=0.000, oom=0.000, wall=62953, train_wall=197541
| epoch 003:    200 / 41687 loss=3.155, nll_loss=1.438, ppl=2.71, wps=33234, ups=0, wpb=81126.264, bsz=4531.234, num_updates=83575, lr=0.000109386, gnorm=0.207, clip=0.000, oom=0.000, wall=63196, train_wall=197779

So this seems just a transient problem, and not really a bug, more a visualization issue.

@lematt1991 the current command I’m using depends upon libraries in our tool - ModernMT so it would be difficult for you to reproduce. I’ll try to reproduce the problem on a raw fairseq execution, but I think it should be quite reproducible: just restart an interrupted training.

Thanks guys for your help!

0reactions
myleottcommented, Jun 23, 2019

Fixed by #778

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inconsistent epoch count after training restart with #781 - GitHub
Hello team, following issue #780, I tried to re-start my training changing the --max-tokens option from 1536 to 1500 (why so small ...
Read more >
Why do I have such inconsistent results when training my ...
Same for the order of the dataset. In each epoch your training data is shuffled and this will be different in every for...
Read more >
Why Do I Get Different Results Each Time in Machine Learning?
After completing this tutorial, you will know: Machine learning algorithms will train different models if the training dataset is changed.
Read more >
What should I do when my neural network doesn't learn?
See if the norm of the weights is increasing abnormally with epochs. if you're getting some error at training time, google that error....
Read more >
Classification on imbalanced data | TensorFlow Core
This tutorial demonstrates how to classify a highly imbalanced dataset in which the number of examples in one class greatly outnumbers the examples...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found