question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

expected behavior for `keep_last_ckpts = -1`

See original GitHub issue

Hi @juliakreutzer,

I wanted to save ckpts at every validation step regardless of early-stopping-metric score, so I set keep_last_ckpts = -1, according to the description here:

https://github.com/joeynmt/joeynmt/blob/46b2fe3b05638728413ee5bae6a347411175c3c5/configs/small.yaml#L67

But joeynmt didn’t save ckpts at all in that case. Actually, the TrainManager doesn’t call _save_checkpoint() func if keep_last_ckpts is less than or equal to zero (queue with infinite length: https://docs.python.org/3/library/queue.html). https://github.com/joeynmt/joeynmt/blob/46b2fe3b05638728413ee5bae6a347411175c3c5/joeynmt/training.py#L98-L99 https://github.com/joeynmt/joeynmt/blob/46b2fe3b05638728413ee5bae6a347411175c3c5/joeynmt/training.py#L544-L547

What is the expected behavior? You indeed intended no save action if keep_last_ckpts = -1, that is, the description in config was wrong or can we change the code so that ckpts will be saved every time if keep_last_ckpts = -1?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
juliakreutzercommented, Apr 14, 2021

Yes you’re right, it’s not we’ll defined when both queues are overlapping. I like your simplification idea: save the most recent one plus any additional best checkpoints. I think this fits the practical use cases in the best way without creating confusion👍

0reactions
may-commented, Mar 28, 2021

yeah, it sounds good. we could have multiple queues, but then a bit complicated to handle mis-specifications such as keep_last_ckpts=-1 and keep_best_ckpts=10, no? Although we can return a configuration error and abort the process, I feel a bit too harsh, especially when I see such a config error after I waited for huge data loaded…

if the usage for the latest ckpt is almost limited to resume the interrupted training, how about saving the latest ckpt always by default without any option? and we always use “best” criterion to determine the number of ckpts to save/delete? Of course, we change the “best” logic so that really best n models will be saved. For instance, in the following case, best 3 means [4000, 6000, 2000], instead of the ckpts with *. Even though the step 6000 doesn’t beat the best bleu sofar, but still better than the worst one in the queue, so we update our queue. Then it’s more likely that relatively newer ckpts will be kept.

       steps 1000 bleu 10.0  *
       steps 2000 bleu 11.0  *
       steps 3000 bleu 10.5  
       steps 4000 bleu 11.5  *
       steps 5000 bleu 10.2
       steps 6000 bleu 11.2

maybe it’s rather my personal preference. Misspecification can happen whatever we define for this option. Two separate queues for best and last also sounds reasonable to me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Teaching Expected and Unexpected Behaviors
One of the key components of talking about expected and unexpected behavior is encouraging students to consider how their behaviors make others ...
Read more >
What are Expected and Unexpected Behaviours? - Twinkl
In whatever situation we find ourselves, it's generally accepted that there are a few unwritten rules. That is, there are expected and unexpected...
Read more >
9/18 and 9/20 2018- Expected and Unexpected Behaviors - ​
We discussed how there are expected behaviors of us in every situation we are in, From eating, to grocery shopping to visiting the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found