why is trainer.current_epoch used for model checkpoint callback for Pytorch Lightning
See original GitHub issue- In the model saving callbacks for Pytorch Lightning, why is
step=trainer.global_step
used instead ofstep=trainer.current_epoch
? Usingcurrent_epoch
seems more intuitive as the save checkpoints per trial will be named after the epoch of that trial and not the global ‘step’ of all trials.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Checkpointing — PyTorch Lightning 1.6.3 documentation
The ModelCheckpoint callback allows you to configure when/which/what/where checkpointing should happen. It follows the normal Callback hook structure so you can ...
Read more >pytorch lightning on_save_checkpoint - You.com | The Search ...
As Pytorch Lightning provides automatic saving for model checkpoints, I use it to save top-k best models. Specifically in Trainer setting, checkpoint_callback = ......
Read more >How to save all your trained model weights locally after every ...
For example, timm (or Pytorch-Image-Models) has it's own checkpoint saver, PyTorch Lightning also has a "Model Checkpoint" callback, fastai also has a "Save ......
Read more >Introducing Multiple ModelCheckpoint Callbacks
PyTorch Lightning v1.5 marks a major leap of reliability to support the increasingly complex demands of the leading AI organizations and ...
Read more >Callbacks - Hugging Face
Callbacks. Callbacks are objects that can customize the behavior of the training loop in the PyTorch Trainer (this feature is not yet implemented...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ah I see, yes using that
current_epoch
as the dir name is fine as long as there is only one checkpoint per epoch. Your current setup looks good.I have my callback like so:
callback = TuneReportCheckpointCallback(..., on="validation_end")
. And internally in the callback method I changedtrainer.global_step
totrainer.current_epoch
for ease of readability of my files.I was afraid I might break something (or worst have a saving bug without knowing it - eg maybe the latest best checkpoint won’t be saved if there already exists a checkpoint with the same name like you mentioned) since I am not so familiar with ray tune.
My guess was if my training/validation setup was in a way that it would never run into the case where multiple checkpoints would be made within a single epoch, then I wouldn’t have to worry about any checkpointing issues. Would having
callback = TuneReportCheckpointCallback(..., on="validation_end")
be enough to ensure this (so I can keeptrainer.current_epoch
)?