question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Naming convention for (pytorch) checkpoints broken?

See original GitHub issue

🚀 Feature request

In previous versions of the library and sample files, checkpoints were saved with some naming convention that had checkpoint in the name file. Subsequent jobs could look in the output directory and check if any checkpoint is available first; if found, it would load the checkpoint and the corresponding config and continue training from where it left off; if not found, it would check for the model_path_or_name.

I’m under the impression that this convention broke, from what I can tell. When using the utilities from the library, for pytorch models, the model is saved with the name pytorch_model.bin (WEIGHTS_NAME in file_utils.py) and when looking to load a checkpoint PREFIX_CHECKPOINT_DIR = “checkpoint” from trainer_utils.py is used. So it doesn’t match and it starts training from scratch.

One (local) way to fix this is to rewrite searching for a checkpoint instead of using the one in the library.

Is there any other option that allows a pipeline of jobs without using different scripts (e.g., one script that loads the original pretrained bert model, for example, and all subsequent runs use a different script that point the model_path to the local path where the pytorch_model.bin is saved).

I guess the feature request is to bring this feature back. One way to do it is to use command line args for checkpoint names instead of using hardcoded naming in the files.

Motivation

Cascading/pipelined training jobs: one job starts, takes a checkpoint, the next one picks up from the last checkpoint. The same script is used for either first or intermediate job in the pipeline.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ioana-bluecommented, Mar 16, 2021

Got it, thanks for clarifying the terminology!

PS: It so happens that I needed a checkpoint to be saved at the end of training, now I understand how that’s done.

0reactions
sguggercommented, Mar 16, 2021

I think you are confusing:

  • saving checkpoints during training, done automatically by the Trainer every save_steps (unless you are using a different strategy) and
  • saving the final model, which is done at the end of training.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Model checkpoint naming consistency · Issue #3125 - GitHub
Going by format_checkpoint_name 's docstring example. It is standard to use - to split metrics and _ to split metric names. This makes...
Read more >
Saving and Loading Models - PyTorch
A common PyTorch convention is to save these checkpoints using the .tar file extension. To load the items, first initialize the model and...
Read more >
Saving and Loading Models — PyTorch Tutorials 1.0.0 ...
Because of this, your code can break in various ways when used in other projects ... A common PyTorch convention is to save...
Read more >
Changelog — PyTorch Lightning 1.8.5 documentation
Changed checkpoints save path in the case of one logger and user-provided weights_save_path from weights_save_path/name/version/checkpoints to ...
Read more >
How do I save a trained model in PyTorch? - Stack Overflow
Rather, it saves a path to the file containing the class, which is used during load time. Because of this, your code can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found