Naming convention for (pytorch) checkpoints broken?
See original GitHub issue🚀 Feature request
In previous versions of the library and sample files, checkpoints were saved with some naming convention that had checkpoint
in the name file. Subsequent jobs could look in the output directory and check if any checkpoint is available first; if found, it would load the checkpoint and the corresponding config and continue training from where it left off; if not found, it would check for the model_path_or_name.
I’m under the impression that this convention broke, from what I can tell. When using the utilities from the library, for pytorch models, the model is saved with the name pytorch_model.bin
(WEIGHTS_NAME in file_utils.py) and when looking to load a checkpoint PREFIX_CHECKPOINT_DIR = “checkpoint” from trainer_utils.py is used. So it doesn’t match and it starts training from scratch.
One (local) way to fix this is to rewrite searching for a checkpoint instead of using the one in the library.
Is there any other option that allows a pipeline of jobs without using different scripts (e.g., one script that loads the original pretrained bert model, for example, and all subsequent runs use a different script that point the model_path to the local path where the pytorch_model.bin is saved).
I guess the feature request is to bring this feature back. One way to do it is to use command line args for checkpoint names instead of using hardcoded naming in the files.
Motivation
Cascading/pipelined training jobs: one job starts, takes a checkpoint, the next one picks up from the last checkpoint. The same script is used for either first or intermediate job in the pipeline.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
Got it, thanks for clarifying the terminology!
PS: It so happens that I needed a checkpoint to be saved at the end of training, now I understand how that’s done.
I think you are confusing:
Trainer
everysave_steps
(unless you are using a different strategy) and