Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Naming convention for (pytorch) checkpoints broken?

See original GitHub issue

🚀 Feature request

In previous versions of the library and sample files, checkpoints were saved with some naming convention that had checkpoint in the name file. Subsequent jobs could look in the output directory and check if any checkpoint is available first; if found, it would load the checkpoint and the corresponding config and continue training from where it left off; if not found, it would check for the model_path_or_name.

I’m under the impression that this convention broke, from what I can tell. When using the utilities from the library, for pytorch models, the model is saved with the name pytorch_model.bin (WEIGHTS_NAME in file_utils.py) and when looking to load a checkpoint PREFIX_CHECKPOINT_DIR = “checkpoint” from trainer_utils.py is used. So it doesn’t match and it starts training from scratch.

One (local) way to fix this is to rewrite searching for a checkpoint instead of using the one in the library.

Is there any other option that allows a pipeline of jobs without using different scripts (e.g., one script that loads the original pretrained bert model, for example, and all subsequent runs use a different script that point the model_path to the local path where the pytorch_model.bin is saved).

I guess the feature request is to bring this feature back. One way to do it is to use command line args for checkpoint names instead of using hardcoded naming in the files.

Motivation

Cascading/pipelined training jobs: one job starts, takes a checkpoint, the next one picks up from the last checkpoint. The same script is used for either first or intermediate job in the pipeline.

Issue Analytics

State:
Created 3 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

ioana-bluecommented, Mar 16, 2021

Got it, thanks for clarifying the terminology!

PS: It so happens that I needed a checkpoint to be saved at the end of training, now I understand how that’s done.

0reactions

sguggercommented, Mar 16, 2021

I think you are confusing:

saving checkpoints during training, done automatically by the Trainer every save_steps (unless you are using a different strategy) and
saving the final model, which is done at the end of training.