Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

expected behavior for `keep_last_ckpts = -1`

See original GitHub issue

Hi @juliakreutzer,

I wanted to save ckpts at every validation step regardless of early-stopping-metric score, so I set keep_last_ckpts = -1, according to the description here:

https://github.com/joeynmt/joeynmt/blob/46b2fe3b05638728413ee5bae6a347411175c3c5/configs/small.yaml#L67

But joeynmt didn’t save ckpts at all in that case. Actually, the TrainManager doesn’t call _save_checkpoint() func if keep_last_ckpts is less than or equal to zero (queue with infinite length: https://docs.python.org/3/library/queue.html). https://github.com/joeynmt/joeynmt/blob/46b2fe3b05638728413ee5bae6a347411175c3c5/joeynmt/training.py#L98-L99 https://github.com/joeynmt/joeynmt/blob/46b2fe3b05638728413ee5bae6a347411175c3c5/joeynmt/training.py#L544-L547

What is the expected behavior? You indeed intended no save action if keep_last_ckpts = -1, that is, the description in config was wrong or can we change the code so that ckpts will be saved every time if keep_last_ckpts = -1?

Issue Analytics

State:
Created 3 years ago
Comments:9

Top GitHub Comments

1reaction

juliakreutzercommented, Apr 14, 2021

Yes you’re right, it’s not we’ll defined when both queues are overlapping. I like your simplification idea: save the most recent one plus any additional best checkpoints. I think this fits the practical use cases in the best way without creating confusion👍

0reactions

may-commented, Mar 28, 2021

yeah, it sounds good. we could have multiple queues, but then a bit complicated to handle mis-specifications such as keep_last_ckpts=-1 and keep_best_ckpts=10, no? Although we can return a configuration error and abort the process, I feel a bit too harsh, especially when I see such a config error after I waited for huge data loaded…

if the usage for the latest ckpt is almost limited to resume the interrupted training, how about saving the latest ckpt always by default without any option? and we always use “best” criterion to determine the number of ckpts to save/delete? Of course, we change the “best” logic so that really best n models will be saved. For instance, in the following case, best 3 means [4000, 6000, 2000], instead of the ckpts with *. Even though the step 6000 doesn’t beat the best bleu sofar, but still better than the worst one in the queue, so we update our queue. Then it’s more likely that relatively newer ckpts will be kept.

       steps 1000 bleu 10.0  *
       steps 2000 bleu 11.0  *
       steps 3000 bleu 10.5  
       steps 4000 bleu 11.5  *
       steps 5000 bleu 10.2
       steps 6000 bleu 11.2

maybe it’s rather my personal preference. Misspecification can happen whatever we define for this option. Two separate queues for best and last also sounds reasonable to me.