question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting iterations in Checkpoint is wrong for `global_step_transform`

See original GitHub issue

There seems to be two bugs in Checkpoint related to global_step_transform when it’s attached to the valid rather than train engine.

First, global_step_transform does a lookup based on the event fired. This causes issues when the handler is not attached to an {EPOCH/ITERATION}_COMPLETED, eg. when it’s attached to COMPLETED on the valid engine as the docs suggest.

Second, global_step_transform is intended to give the “true” count (iteration, epoch, whatever it may be). As such, it should not only be used in the filename, but also as the priority. Right now, priority is the iteration count of the engine it’s attached to, which again does not work for valid engine.

A third point, which isn’t really a bug but more usability: Checkpoint silently drops checkpoints if it has checkpointed the same filename before. I think such occurrences are likely user error (or in my case, framework error, since my iteration count of valid engine is always the same at COMPLETED). Perhaps a warning log is warranted. Alternatively, if the checkpoint is truly the same, writing it again is idempotent, so perhaps this check should be removed entirely.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
amatsukawacommented, Jun 22, 2020

Thanks for explicit example @amatsukawa !

Thanks for your attention on this!

So, ideally we would like to have the following code

handler = Checkpoint(to_save, DiskSaver('/tmp/models', create_dir=True),
                     filename_prefix='ckpt',
                     global_step_transform=global_step_from_engine(trainer, Events.ITERATION_COMPLETED))

evaluator.add_event_handler(Events.COMPLETED, handler)

produce the output

[ ckpt_checkpoint_30, ]

Yup, that would be ideal. I think this would require that global_step be used as priority if it exists.

I may have had this misconception because TF calls the iteration count global_step.

Yes, you are right about this. And if you are using Tensorboard, global step notion is more relaxed 😃

Perhaps something like: iteration_from_engine, epoch_from_engine, event_attr_from_engine would be more clear 😃

0reactions
vfdev-5commented, Jun 22, 2020

We’ll try to work on it. I put this issue into 0.4.1 project kanban for instance.

@amatsukawa, so just let me tell that if you would like to contribute to the project and send an initiail PR that we could work out further, please do not hesitate 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

set global_step from checkpoint #825 - pytorch/ignite - GitHub
Instead, they use get_event_attrib_value to get the global_step, ... in tensorboard logger using engine.state.iteration as global step.
Read more >
ignite.handlers — PyTorch-Ignite v0.4.1 Documentation
Checkpoint handler can be used to periodically save and load objects ... global step transform function to output a desired global step.
Read more >
Tensorflow: Is it possible to modify the global step in checkpoints
exponential_decay will produce incorrect learning_rate as the number of sub_iterations is reduced in machine B. learning_rate = tf.train.
Read more >
A quick complete tutorial to save and restore Tensorflow models
Along with this, Tensorflow also has a file named checkpoint which simply ... We only save the model for further iterations, as the...
Read more >
LightningModule - PyTorch Lightning - Read the Docs
For cases like production, you might want to iterate different models inside a ... By using this, Lightning can ensure that all the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found