Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ModelCheckpoint's _saved variable and EarlyStopping

See original GitHub issue

I’m using ignite 0.2.1 similar to the transfer-learning-conv-ai repo by Hugging Face. In these lines, you can see that:

the checkpoint is being saved for every epoch
just the last three saved checkpoints are being retained on disk
the last checkpoint (due to _saved[-1]) is being renamed to be the final trained model

In my code, I’m additionally using the EarlyStopping class with a configurable patience like this:

    valid_es_handler = EarlyStopping(patience=args.patience, score_function=early_stopping_score_function,
                                     trainer=trainer)
    validator.add_event_handler(Events.COMPLETED, valid_es_handler)

Now what I want to accomplish is this: I want to identify and rename the best (in terms of validation set score) trained model from the window of stored checkpoints.

I think the first change that needs to be done is n_saved=args.patience from n_saved=3, so that the window of saved checkpoints is equal to the patience used for early stopping.

Consequently, it looks like I need to provide the same early_stopping_score_function also to ModelCheckpoint using the score_function arg, and that would create a score-based priority queue of saved checkpoints.

And with those changes, it looks like _saved[-1] would still point to the “best” model checkpoint in the window. Is my understanding of the changes correct?

Also, I haven’t looked at the newer versions of ignite after 0.2.1, but could you please share what the breaking changes are (using the above linked code as an example)? I might consider upgrading to the latest ignite if the changes needed are minimal.

@vfdev-5

The other thing I don’t understand is this - the score function would be called on the engine, but for our use-case, this engine should be the validator (for both EarlyStopping and ModelCheckpoint), right?

But this line in the transfer-learning-conv-ai repo:

trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': getattr(model, 'module', model)}) # "getattr" take care of distributed encapsulation

will end up making the score function call on the trainer Engine if I understand correctly. How do I ensure that the validator is used for the score function in the checkpoint_handler?

Issue Analytics

State:
Created 3 years ago
Comments:8 (1 by maintainers)

Top GitHub Comments

1reaction

vfdev-5commented, Apr 9, 2020

@g-karthik please tell us if @sdesrozis 's solution does not fit.

And with those changes, it looks like _saved[-1] would still point to the “best” model checkpoint in the window. Is my understanding of the changes correct?

There were a bug with that found recently : https://github.com/pytorch/ignite/pull/745 It was then fixed and code is available in nightly release.

Also, I haven’t looked at the newer versions of ignite after 0.2.1, but could you please share what the breaking changes are (using the above linked code as an example)? I might consider upgrading to the latest ignite if the changes needed are minimal.

Please, the release notes of 0.3.0 and keep us updated if you have other questions 😃

1reaction

sdesroziscommented, Apr 9, 2020

Thank you for this report +1

I don’t have ignite 0.2.1 in mind but for checkpoint, please look the following code

global_step_transform = global_step_from_engine(trainer)

best_model_handler = ModelCheckpoint(
        dirname=output_path,
        filename_prefix="best",
        n_saved=n_saved,
        global_step_transform=global_step_transform,
        score_name="{}_{}".format(tag, metric_name.lower()),
        score_function=get_default_score_fn(metric_name),
    )

evaluator.add_event_handler(Events.COMPLETED, best_model_handler, {"model": model,})

This snippet is from https://github.com/pytorch/ignite/blob/master/ignite/contrib/engines/common.py to help to define handlers.

So it’s possible to save wrt to a metric 😃 and the score is suffixed in the name of the checkpoint file.

I hope it could help !

EDIT : Ok you pointed out internal ignite code so I suppose you already see that :’

EDIT 2 : for the second part of your question, I think that checkpoint should be attached to evaluator (like the snippet I shared). Althought, I don’t know if it’s ok with ignite 0.2.1…

REMARK ~~Maybe we could refactor the code from HuggingFace to update to a recent version of ignite ?~~ The requirements.txt refers to pytorch-ignite so I guess 0.3 (see https://github.com/huggingface/transfer-learning-conv-ai/blob/master/requirements.txt)

@vfdev-5 you should have more inputs.