question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

why is trainer.current_epoch used for model checkpoint callback for Pytorch Lightning

See original GitHub issue
  1. In the model saving callbacks for Pytorch Lightning, why is step=trainer.global_step used instead of step=trainer.current_epoch? Using current_epoch seems more intuitive as the save checkpoints per trial will be named after the epoch of that trial and not the global ‘step’ of all trials.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
amogkamcommented, Feb 26, 2021

Ah I see, yes using that current_epoch as the dir name is fine as long as there is only one checkpoint per epoch. Your current setup looks good.

0reactions
turmeric-blendcommented, Feb 26, 2021

I have my callback like so: callback = TuneReportCheckpointCallback(..., on="validation_end"). And internally in the callback method I changed trainer.global_step to trainer.current_epoch for ease of readability of my files.

with tune.checkpoint_dir(step=trainer.current_epoch) as checkpoint_dir:   <--------
            trainer.save_checkpoint(
                os.path.join(checkpoint_dir, self._filename))

I was afraid I might break something (or worst have a saving bug without knowing it - eg maybe the latest best checkpoint won’t be saved if there already exists a checkpoint with the same name like you mentioned) since I am not so familiar with ray tune.

My guess was if my training/validation setup was in a way that it would never run into the case where multiple checkpoints would be made within a single epoch, then I wouldn’t have to worry about any checkpointing issues. Would having callback = TuneReportCheckpointCallback(..., on="validation_end") be enough to ensure this (so I can keep trainer.current_epoch)?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Checkpointing — PyTorch Lightning 1.6.3 documentation
The ModelCheckpoint callback allows you to configure when/which/what/where checkpointing should happen. It follows the normal Callback hook structure so you can ...
Read more >
pytorch lightning on_save_checkpoint - You.com | The Search ...
As Pytorch Lightning provides automatic saving for model checkpoints, I use it to save top-k best models. Specifically in Trainer setting, checkpoint_callback = ......
Read more >
How to save all your trained model weights locally after every ...
For example, timm (or Pytorch-Image-Models) has it's own checkpoint saver, PyTorch Lightning also has a "Model Checkpoint" callback, fastai also has a "Save ......
Read more >
Introducing Multiple ModelCheckpoint Callbacks
PyTorch Lightning v1.5 marks a major leap of reliability to support the increasingly complex demands of the leading AI organizations and ...
Read more >
Callbacks - Hugging Face
Callbacks. Callbacks are objects that can customize the behavior of the training loop in the PyTorch Trainer (this feature is not yet implemented...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found