Resume training from Lightning generated ckpt alone
See original GitHub issueDescription & Motivation
It seems that whenever checkpointing is enabled, a corresponding *.ckpt
file is generated, containing everything from the model weights, to the logging data and the parameters of the LitModule and LitDataModule.
This sounded great to me and so I thought naively that I could simply resume a training running just the following:
python mainCLI.py fit --ckpt_path /path/to/my/checkpoint.ckpt
However, this does not work, giving me the following error:
usage: mainCLI.py [-h] [-c CONFIG] [--print_config[=flags]] {fit,validate,test,predict,tune} ...
mainCLI.py: error: Configuration check failed :: Key "fit.data.dataset_dir" is required but not included in config object or its value is None.
This is an error for a missing path to my dataset, which is contained in the *.ckpt
file which I’m trying to resume from.
Perhaps this is an intended behavior, or perhaps this is a bug. I’m guessing this has simply not been implemented, but I was wondering if this would be an opportunity to allow for all the parameters to be loaded from the ckpt if it contains them.
There might be possible scenarios which need to be handled though: e.g. if someone provides both a *.ckpt
file and a config file, which one takes priority? The config file could take priority, allowing for someone to modify some parameters when resuming the training (example expand trainer.max_time if the reason the training ended was due to the trainer have reached maximum time). But if someone can change those parameters, then what parameters are reported by the logger? If someone changes the learning-rate for instance, which learning rate should be logged? I don’t have an answer to that, but it seems these questions could also already be asked today, as it’s technically already possible to resume a checkpoint differently and modify some parameters. Currently, no change to the parameters are reported, so first parameters are logged and no new parameters are logged.
Pitch
No response
Alternatives
No response
Additional context
No response
EDIT: to address the above question: as per @carmocca 's answer this is not expected to work as thought and outlined above. For this to work, you need to specify the config file using --config /path/to/my/config.yaml
but this is just a placeholder config file and the content of this config file will in fact be replaced by the parameters contained in the checkpoint.
Issue Analytics
- State:
- Created 9 months ago
- Comments:9 (5 by maintainers)
Top GitHub Comments
There’s many ways to do that. A simple one that uses the CLI would be (pseudocodeish)
or
without passing the
ckpt_file
argument at all to the TrainerYes!
Yes, you need to override this hook in your DataModule: https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.core.LightningDataModule.html#pytorch_lightning.core.LightningDataModule.load_state_dict