question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TrainsSaver doesn't respect Checkpoint's n_saved

See original GitHub issue

🐛 Bug description

As the title says, it seems that TrainsSaver bypasses the Checkpoint n_saved parameter. That means that all models are saved and never updated / deleted.

Consider this simple example:

        task.phases['train'].add_event_handler(
            Events.EPOCH_COMPLETED(every=1),
            Checkpoint(to_save, TrainsSaver(output_uri=output_uri), 'epoch', n_saved=1,
                       global_step_transform=global_step_from_engine(task.phases['train'])))

The above saves every checkpoint. You end-up with

epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
epoch_checkpoint_3.pt
...

Now if we do, the same with DiskSaver:

        task.phases['train'].add_event_handler(
            Events.EPOCH_COMPLETED(every=1),
            Checkpoint(to_save, DiskSaver(dirname=dirname), 'epoch', n_saved=1,
                       global_step_transform=global_step_from_engine(task.phases['train'])))

We get only:

epoch_checkpoint_3.pt

as expected.

Same behaviour if we save only best models using score_function, i.e. TrainsSaver saves every best model.

Environment

  • PyTorch Version: 1.3.1
  • Ignite Version: 0.4.0.dev20200519 (EDIT: update to latest nightly, issue still exists)
  • OS: Linux
  • How you installed Ignite: pip nightly
  • Python version: 3.6
  • Any other relevant information: trains version: 0.14.3

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:34 (17 by maintainers)

github_iconTop GitHub Comments

3reactions
bmartinncommented, May 23, 2020

Thanks @vfdev-5 ! kudos for the quick PR! We will add it into TrainsServer (probably needs to add a bit of support in Trains as well) I’ll update here once the PR is ready

3reactions
vfdev-5commented, May 22, 2020

I was also thinking about if we can pass more info to save_handler in addition to object_to_save and filename: https://github.com/pytorch/ignite/blob/c012166f93e56f8e9538741f5745a5010983ba38/ignite/handlers/checkpoint.py#L21

For example, we can opt to pass some meta-info about the checkpoint to save:

class BaseSaveHandler(metaclass=ABCMeta):
    """Base class for save handlers"""

    @abstractmethod
    def __call__(self, checkpoint: Mapping, filename: str, metadata=None) -> None:
        pass

and in metadata we can pass prefix, name and all scores which compose the filename.

This certainly requires minor API change for DiskSaver and other savers. However, we recently introduced BaseSaveHandler as base class for savers, so we still can change thing now…

Read more comments on GitHub >

github_iconTop Results From Across the Web

No Saves Or Checkpoints!!!! - Dovetail Games Forums
Saving a game would more often than not result in red signals and trains not moving along the line resulting in gridlock.
Read more >
Is there a way to write TensorFlow checkpoints asynchronously?
You can write checkpoints asynchronously by running saver.save() in a separate thread. The (internal) SVTimerCheckpointThread is an example ...
Read more >
TensorFlow - Resume training in middle of an epoch?
I have a general question regarding TensorFlow's saver function. The saver class allows us to save a session via: saver.save(sess, "checkpoints.
Read more >
Save/Checkpoint not working? :: Unrailed! General Talks
At first this was fine as we were learning the game and got to try out different wagon pieces to add to the...
Read more >
How to Save and Load Your Keras Deep Learning Model
Keras is a simple and powerful Python library for deep learning. Since deep learning models can take hours, days, and even weeks to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found