Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting iterations in Checkpoint is wrong for `global_step_transform`

See original GitHub issue

There seems to be two bugs in Checkpoint related to global_step_transform when it’s attached to the valid rather than train engine.

First, global_step_transform does a lookup based on the event fired. This causes issues when the handler is not attached to an {EPOCH/ITERATION}_COMPLETED, eg. when it’s attached to COMPLETED on the valid engine as the docs suggest.

Second, global_step_transform is intended to give the “true” count (iteration, epoch, whatever it may be). As such, it should not only be used in the filename, but also as the priority. Right now, priority is the iteration count of the engine it’s attached to, which again does not work for valid engine.

A third point, which isn’t really a bug but more usability: Checkpoint silently drops checkpoints if it has checkpointed the same filename before. I think such occurrences are likely user error (or in my case, framework error, since my iteration count of valid engine is always the same at COMPLETED). Perhaps a warning log is warranted. Alternatively, if the checkpoint is truly the same, writing it again is idempotent, so perhaps this check should be removed entirely.

Issue Analytics

State:
Created 3 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

amatsukawacommented, Jun 22, 2020

Thanks for explicit example @amatsukawa !

Thanks for your attention on this!

So, ideally we would like to have the following code

handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
                     filename_prefix='ckpt',
                     global_step_transform=global_step_from_engine(trainer, Events.ITERATION_COMPLETED))

evaluator.add_event_handler(Events.COMPLETED, handler)

produce the output

[ ckpt_checkpoint_30, ]

Yup, that would be ideal. I think this would require that global_step be used as priority if it exists.

I may have had this misconception because TF calls the iteration count global_step.

Yes, you are right about this. And if you are using Tensorboard, global step notion is more relaxed 😃

Perhaps something like: iteration_from_engine, epoch_from_engine, event_attr_from_engine would be more clear 😃

0reactions

vfdev-5commented, Jun 22, 2020

We’ll try to work on it. I put this issue into 0.4.1 project kanban for instance.

@amatsukawa, so just let me tell that if you would like to contribute to the project and send an initiail PR that we could work out further, please do not hesitate 😃