question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Train] Support custom checkpoint file names

See original GitHub issue

Currently, the file names for Ray Train checkpoints are not customizable. They will always be of the format checkpoint_XXX.

Provide a way for the user to specify the name of the checkpoint file that they save.

One possible API is to allow the user to specify the checkpoint file name in train.save_checkpoint()

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
matthewdengcommented, Dec 7, 2021

Thanks @andrijazz for providing this context, I apologize if my comment came off as trying to restrict users!

The reason I asked was because with the current API, even with the ability to customize individual checkpoint names there may be some confusion since they will be written to the <logdir>/run_<run_id>/checkpoints directory, which may change over runs. Perhaps we need to allow customization of this directory name as well…

0reactions
andrijazzcommented, Dec 16, 2021

Being able to specify custom path and name of the checkpoints would be great.

One other use-case that comes to mind is that user might want to store checkpoints outside of ray generated folders … for example wandb creates its own dirs and automatically uploads all files stored in those dirs to the cloud after run is finished. User might want to store checkpoints inside wandb dir because he can easliy browse through them on the wandb web app and decide which one to use based on the plots.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training checkpoints | TensorFlow Core
TensorFlow matches variables to checkpointed values by traversing a directed graph with named edges, starting from the object being loaded. Edge names typically ......
Read more >
Saving Checkpoints during Training - PyKEEN - Read the Docs
Here we have defined a pipeline that will save training loop checkpoints in the checkpoint file called my_checkpoint.pt every time an epoch finishes...
Read more >
Checkpoints | Data Version Control · DVC
The checkpoint file, specified with --model 'model.pt' , is an output from one checkpoint that becomes a dependency for the next checkpoint. The...
Read more >
A Guide To Using Checkpoints — Ray 2.2.0
This topic is relevant to trial checkpoints. Tune stores checkpoints on the node where the trials are executed. If you are training on...
Read more >
Checkpointing - Composer - MosaicML
To customize the filenames of checkpoints inside save_folder , you can set the save_filename argument. By default, checkpoints will be named like ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found