Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

why is trainer.current_epoch used for model checkpoint callback for Pytorch Lightning

See original GitHub issue

In the model saving callbacks for Pytorch Lightning, why is step=trainer.global_step used instead of step=trainer.current_epoch? Using current_epoch seems more intuitive as the save checkpoints per trial will be named after the epoch of that trial and not the global ‘step’ of all trials.

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

amogkamcommented, Feb 26, 2021

Ah I see, yes using that current_epoch as the dir name is fine as long as there is only one checkpoint per epoch. Your current setup looks good.

0reactions

turmeric-blendcommented, Feb 26, 2021

I have my callback like so: callback = TuneReportCheckpointCallback(..., on="validation_end"). And internally in the callback method I changed trainer.global_step to trainer.current_epoch for ease of readability of my files.

with tune.checkpoint_dir(step=trainer.current_epoch) as checkpoint_dir:   <--------
            trainer.save_checkpoint(
                os.path.join(checkpoint_dir, self._filename))

I was afraid I might break something (or worst have a saving bug without knowing it - eg maybe the latest best checkpoint won’t be saved if there already exists a checkpoint with the same name like you mentioned) since I am not so familiar with ray tune.

My guess was if my training/validation setup was in a way that it would never run into the case where multiple checkpoints would be made within a single epoch, then I wouldn’t have to worry about any checkpointing issues. Would having callback = TuneReportCheckpointCallback(..., on="validation_end") be enough to ensure this (so I can keep trainer.current_epoch)?

Top Results From Across the Web

Checkpointing — PyTorch Lightning 1.6.3 documentation

The ModelCheckpoint callback allows you to configure when/which/what/where checkpointing should happen. It follows the normal Callback hook structure so you can ...

pytorch lightning on_save_checkpoint - You.com | The Search ...

As Pytorch Lightning provides automatic saving for model checkpoints, I use it to save top-k best models. Specifically in Trainer setting, checkpoint_callback = ......

How to save all your trained model weights locally after every ...

For example, timm (or Pytorch-Image-Models) has it's own checkpoint saver, PyTorch Lightning also has a "Model Checkpoint" callback, fastai also has a "Save ......

Introducing Multiple ModelCheckpoint Callbacks

PyTorch Lightning v1.5 marks a major leap of reliability to support the increasingly complex demands of the leading AI organizations and ...

Callbacks - Hugging Face

Callbacks. Callbacks are objects that can customize the behavior of the training loop in the PyTorch Trainer (this feature is not yet implemented...