TrainsSaver doesn't respect Checkpoint's n_saved
See original GitHub issue🐛 Bug description
As the title says, it seems that TrainsSaver
bypasses the Checkpoint
n_saved
parameter. That means that all models are saved and never updated / deleted.
Consider this simple example:
task.phases['train'].add_event_handler(
Events.EPOCH_COMPLETED(every=1),
Checkpoint(to_save, TrainsSaver(output_uri=output_uri), 'epoch', n_saved=1,
global_step_transform=global_step_from_engine(task.phases['train'])))
The above saves every checkpoint. You end-up with
epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
epoch_checkpoint_3.pt
...
Now if we do, the same with DiskSaver
:
task.phases['train'].add_event_handler(
Events.EPOCH_COMPLETED(every=1),
Checkpoint(to_save, DiskSaver(dirname=dirname), 'epoch', n_saved=1,
global_step_transform=global_step_from_engine(task.phases['train'])))
We get only:
epoch_checkpoint_3.pt
as expected.
Same behaviour if we save only best models using score_function
, i.e. TrainsSaver
saves every best model.
Environment
- PyTorch Version: 1.3.1
- Ignite Version: 0.4.0.dev20200519 (EDIT: update to latest nightly, issue still exists)
- OS: Linux
- How you installed Ignite: pip nightly
- Python version: 3.6
- Any other relevant information: trains version: 0.14.3
Issue Analytics
- State:
- Created 3 years ago
- Comments:34 (17 by maintainers)
Top Results From Across the Web
No Saves Or Checkpoints!!!! - Dovetail Games Forums
Saving a game would more often than not result in red signals and trains not moving along the line resulting in gridlock.
Read more >Is there a way to write TensorFlow checkpoints asynchronously?
You can write checkpoints asynchronously by running saver.save() in a separate thread. The (internal) SVTimerCheckpointThread is an example ...
Read more >TensorFlow - Resume training in middle of an epoch?
I have a general question regarding TensorFlow's saver function. The saver class allows us to save a session via: saver.save(sess, "checkpoints.
Read more >Save/Checkpoint not working? :: Unrailed! General Talks
At first this was fine as we were learning the game and got to try out different wagon pieces to add to the...
Read more >How to Save and Load Your Keras Deep Learning Model
Keras is a simple and powerful Python library for deep learning. Since deep learning models can take hours, days, and even weeks to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @vfdev-5 ! kudos for the quick PR! We will add it into
TrainsServer
(probably needs to add a bit of support inTrains
as well) I’ll update here once the PR is readyI was also thinking about if we can pass more info to
save_handler
in addition toobject_to_save
andfilename
: https://github.com/pytorch/ignite/blob/c012166f93e56f8e9538741f5745a5010983ba38/ignite/handlers/checkpoint.py#L21For example, we can opt to pass some meta-info about the checkpoint to save:
and in metadata we can pass
prefix
,name
and all scores which compose the filename.This certainly requires minor API change for
DiskSaver
and other savers. However, we recently introducedBaseSaveHandler
as base class for savers, so we still can change thing now…