Getting iterations in Checkpoint is wrong for `global_step_transform`
See original GitHub issueThere seems to be two bugs in Checkpoint
related to global_step_transform
when it’s attached to the valid rather than train engine.
First, global_step_transform
does a lookup based on the event fired. This causes issues when the handler is not attached to an {EPOCH/ITERATION}_COMPLETED
, eg. when it’s attached to COMPLETED
on the valid engine as the docs suggest.
Second, global_step_transform
is intended to give the “true” count (iteration, epoch, whatever it may be). As such, it should not only be used in the filename, but also as the priority
. Right now, priority is the iteration count of the engine it’s attached to, which again does not work for valid engine.
A third point, which isn’t really a bug but more usability: Checkpoint
silently drops checkpoints if it has checkpointed the same filename before. I think such occurrences are likely user error (or in my case, framework error, since my iteration count of valid engine is always the same at COMPLETED
). Perhaps a warning log is warranted. Alternatively, if the checkpoint is truly the same, writing it again is idempotent, so perhaps this check should be removed entirely.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (3 by maintainers)
Top GitHub Comments
Thanks for your attention on this!
Yup, that would be ideal. I think this would require that
global_step
be used aspriority
if it exists.Perhaps something like:
iteration_from_engine
,epoch_from_engine
,event_attr_from_engine
would be more clear 😃We’ll try to work on it. I put this issue into 0.4.1 project kanban for instance.
@amatsukawa, so just let me tell that if you would like to contribute to the project and send an initiail PR that we could work out further, please do not hesitate 😃