Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resume training from Lightning generated ckpt alone

See original GitHub issue

Description & Motivation

It seems that whenever checkpointing is enabled, a corresponding *.ckpt file is generated, containing everything from the model weights, to the logging data and the parameters of the LitModule and LitDataModule.

This sounded great to me and so I thought naively that I could simply resume a training running just the following: python mainCLI.py fit --ckpt_path /path/to/my/checkpoint.ckpt

However, this does not work, giving me the following error:

usage: mainCLI.py [-h] [-c CONFIG] [--print_config[=flags]] {fit,validate,test,predict,tune} ...
mainCLI.py: error: Configuration check failed :: Key "fit.data.dataset_dir" is required but not included in config object or its value is None.

This is an error for a missing path to my dataset, which is contained in the *.ckpt file which I’m trying to resume from.

Perhaps this is an intended behavior, or perhaps this is a bug. I’m guessing this has simply not been implemented, but I was wondering if this would be an opportunity to allow for all the parameters to be loaded from the ckpt if it contains them.

There might be possible scenarios which need to be handled though: e.g. if someone provides both a *.ckpt file and a config file, which one takes priority? The config file could take priority, allowing for someone to modify some parameters when resuming the training (example expand trainer.max_time if the reason the training ended was due to the trainer have reached maximum time). But if someone can change those parameters, then what parameters are reported by the logger? If someone changes the learning-rate for instance, which learning rate should be logged? I don’t have an answer to that, but it seems these questions could also already be asked today, as it’s technically already possible to resume a checkpoint differently and modify some parameters. Currently, no change to the parameters are reported, so first parameters are logged and no new parameters are logged.

Pitch

No response

Alternatives

No response

Additional context

No response

cc @carmocca @mauvilsa

EDIT: to address the above question: as per @carmocca 's answer this is not expected to work as thought and outlined above. For this to work, you need to specify the config file using --config /path/to/my/config.yaml but this is just a placeholder config file and the content of this config file will in fact be replaced by the parameters contained in the checkpoint.

Issue Analytics

State:
Created 9 months ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

carmoccacommented, Dec 13, 2022

There’s many ways to do that. A simple one that uses the CLI would be (pseudocodeish)

class MyCLI(LightningCLI):
    def before_fit(self):
        weights = ...
        self.model.load_state_dict(weights)

MyCLI()

cli = LightningCLI(run=False)
model = cli.model
weights = ...
model.load_state_dict(weights)
cli.trainer.fit(model)

without passing the ckpt_file argument at all to the Trainer

1reaction

carmoccacommented, Dec 13, 2022

the end result of the instances is following the content of what is defined in the checkpoint, is this understanding correct?

Yes!

does it require a manual intervention?

Yes, you need to override this hook in your DataModule: https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.core.LightningDataModule.html#pytorch_lightning.core.LightningDataModule.load_state_dict

Top Results From Across the Web

LightningModule - PyTorch Lightning - Read the Docs

A LightningModule organizes your PyTorch code into 6 sections: Computations (init). Train Loop (training_step). Validation Loop (validation_step). Test Loop ( ...

PyTorch Lightning vs Ignite: What Are the Differences?

Lightning provides structure to pytorch functions where they're arranged in a manner to prevent errors during model training, which usually ...

Troubleshooting Guide — TAO Toolkit 4.0 documentation

Resume must be used only for completing a previously launched job with the same command and same version of TAO Toolkit. When resuming...

pytorch checkpoint save memory - casavacanzelescuderie.it

Unlike plain PyTorch, Lightning saves everything you need to restore a 32 and 16 bit floating points to reduce memory footprint during model...

How to resume training - Trainer - PyTorch Lightning

I don't understand how to resume the training (from the last checkpoint). The following: trainer = pl.