question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ModelCheckpoint's _saved variable and EarlyStopping

See original GitHub issue

I’m using ignite 0.2.1 similar to the transfer-learning-conv-ai repo by Hugging Face. In these lines, you can see that:

  • the checkpoint is being saved for every epoch
  • just the last three saved checkpoints are being retained on disk
  • the last checkpoint (due to _saved[-1]) is being renamed to be the final trained model

In my code, I’m additionally using the EarlyStopping class with a configurable patience like this:

    valid_es_handler = EarlyStopping(patience=args.patience, score_function=early_stopping_score_function,
                                     trainer=trainer)
    validator.add_event_handler(Events.COMPLETED, valid_es_handler)

Now what I want to accomplish is this: I want to identify and rename the best (in terms of validation set score) trained model from the window of stored checkpoints.

I think the first change that needs to be done is n_saved=args.patience from n_saved=3, so that the window of saved checkpoints is equal to the patience used for early stopping.

Consequently, it looks like I need to provide the same early_stopping_score_function also to ModelCheckpoint using the score_function arg, and that would create a score-based priority queue of saved checkpoints.

And with those changes, it looks like _saved[-1] would still point to the “best” model checkpoint in the window. Is my understanding of the changes correct?

Also, I haven’t looked at the newer versions of ignite after 0.2.1, but could you please share what the breaking changes are (using the above linked code as an example)? I might consider upgrading to the latest ignite if the changes needed are minimal.

@vfdev-5

The other thing I don’t understand is this - the score function would be called on the engine, but for our use-case, this engine should be the validator (for both EarlyStopping and ModelCheckpoint), right?

But this line in the transfer-learning-conv-ai repo:

trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': getattr(model, 'module', model)}) # "getattr" take care of distributed encapsulation

will end up making the score function call on the trainer Engine if I understand correctly. How do I ensure that the validator is used for the score function in the checkpoint_handler?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
vfdev-5commented, Apr 9, 2020

@g-karthik please tell us if @sdesrozis 's solution does not fit.

And with those changes, it looks like _saved[-1] would still point to the “best” model checkpoint in the window. Is my understanding of the changes correct?

There were a bug with that found recently : https://github.com/pytorch/ignite/pull/745 It was then fixed and code is available in nightly release.

Also, I haven’t looked at the newer versions of ignite after 0.2.1, but could you please share what the breaking changes are (using the above linked code as an example)? I might consider upgrading to the latest ignite if the changes needed are minimal.

Please, the release notes of 0.3.0 and keep us updated if you have other questions 😃

1reaction
sdesroziscommented, Apr 9, 2020

Thank you for this report +1

I don’t have ignite 0.2.1 in mind but for checkpoint, please look the following code

global_step_transform = global_step_from_engine(trainer)

best_model_handler = ModelCheckpoint(
        dirname=output_path,
        filename_prefix="best",
        n_saved=n_saved,
        global_step_transform=global_step_transform,
        score_name="{}_{}".format(tag, metric_name.lower()),
        score_function=get_default_score_fn(metric_name),
    )

evaluator.add_event_handler(Events.COMPLETED, best_model_handler, {"model": model,})

This snippet is from https://github.com/pytorch/ignite/blob/master/ignite/contrib/engines/common.py to help to define handlers.

So it’s possible to save wrt to a metric 😃 and the score is suffixed in the name of the checkpoint file.

I hope it could help !

EDIT : Ok you pointed out internal ignite code so I suppose you already see that :’

EDIT 2 : for the second part of your question, I think that checkpoint should be attached to evaluator (like the snippet I shared). Althought, I don’t know if it’s ok with ignite 0.2.1

REMARK Maybe we could refactor the code from HuggingFace to update to a recent version of ignite ? The requirements.txt refers to pytorch-ignite so I guess 0.3 (see https://github.com/huggingface/transfer-learning-conv-ai/blob/master/requirements.txt)

@vfdev-5 you should have more inputs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Beginners Guide to Keras CallBacks, ModelCheckpoint and ...
Tutorial On Keras CallBacks, ModelCheckpoint and EarlyStopping in ... We have stored them in training and testing variables accordingly.
Read more >
Save the best model using ModelCheckpoint and ...
We understanding how to save the best model during the training. Using two different callbacks ModelCheckpoint and EarlyStopping.
Read more >
machine-learning-articles/avoid-wasting-resources-with ...
Together, EarlyStopping and ModelCheckpoint allow you to stop early, saving computational resources, while maintaining the best performing ...
Read more >
Use Early Stopping to Halt the Training of Neural Networks At ...
Recall that early stopping is monitoring loss on the validation dataset and that the model checkpoint is saving models based on accuracy. As ......
Read more >
tf.keras.callbacks.ModelCheckpoint | TensorFlow v2.11.0
ModelCheckpoint callback is used in conjunction with training using model.fit() to save a model or weights (in a checkpoint file) at some ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found