Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TrainsSaver doesn't respect Checkpoint's n_saved

See original GitHub issue

🐛 Bug description

As the title says, it seems that TrainsSaver bypasses the Checkpoint n_saved parameter. That means that all models are saved and never updated / deleted.

Consider this simple example:

        task.phases['train'].add_event_handler(
            Events.EPOCH_COMPLETED(every=1),
            Checkpoint(to_save, TrainsSaver(output_uri=output_uri), 'epoch', n_saved=1,
                       global_step_transform=global_step_from_engine(task.phases['train'])))

The above saves every checkpoint. You end-up with

epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
epoch_checkpoint_3.pt
...

Now if we do, the same with DiskSaver:

        task.phases['train'].add_event_handler(
            Events.EPOCH_COMPLETED(every=1),
            Checkpoint(to_save, DiskSaver(dirname=dirname), 'epoch', n_saved=1,
                       global_step_transform=global_step_from_engine(task.phases['train'])))

We get only:

epoch_checkpoint_3.pt

as expected.

Same behaviour if we save only best models using score_function, i.e. TrainsSaver saves every best model.

Environment

PyTorch Version: 1.3.1
Ignite Version: 0.4.0.dev20200519 (EDIT: update to latest nightly, issue still exists)
OS: Linux
How you installed Ignite: pip nightly
Python version: 3.6
Any other relevant information: trains version: 0.14.3

Issue Analytics

State:
Created 3 years ago
Comments:34 (17 by maintainers)

Top GitHub Comments

3reactions

bmartinncommented, May 23, 2020

Thanks @vfdev-5 ! kudos for the quick PR! We will add it into TrainsServer (probably needs to add a bit of support in Trains as well) I’ll update here once the PR is ready

3reactions

vfdev-5commented, May 22, 2020

I was also thinking about if we can pass more info to save_handler in addition to object_to_save and filename: https://github.com/pytorch/ignite/blob/c012166f93e56f8e9538741f5745a5010983ba38/ignite/handlers/checkpoint.py#L21

For example, we can opt to pass some meta-info about the checkpoint to save:

class BaseSaveHandler(metaclass=ABCMeta):
    """Base class for save handlers"""

    @abstractmethod
    def __call__(self, checkpoint: Mapping, filename: str, metadata=None) -> None:
        pass

and in metadata we can pass prefix, name and all scores which compose the filename.

This certainly requires minor API change for DiskSaver and other savers. However, we recently introduced BaseSaveHandler as base class for savers, so we still can change thing now…