Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Train] Support custom checkpoint file names

See original GitHub issue

Currently, the file names for Ray Train checkpoints are not customizable. They will always be of the format checkpoint_XXX.

Provide a way for the user to specify the name of the checkpoint file that they save.

One possible API is to allow the user to specify the checkpoint file name in train.save_checkpoint()

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

matthewdengcommented, Dec 7, 2021

Thanks @andrijazz for providing this context, I apologize if my comment came off as trying to restrict users!

The reason I asked was because with the current API, even with the ability to customize individual checkpoint names there may be some confusion since they will be written to the <logdir>/run_<run_id>/checkpoints directory, which may change over runs. Perhaps we need to allow customization of this directory name as well…

0reactions

andrijazzcommented, Dec 16, 2021

Being able to specify custom path and name of the checkpoints would be great.

One other use-case that comes to mind is that user might want to store checkpoints outside of ray generated folders … for example wandb creates its own dirs and automatically uploads all files stored in those dirs to the cloud after run is finished. User might want to store checkpoints inside wandb dir because he can easliy browse through them on the wandb web app and decide which one to use based on the plots.

Top Results From Across the Web

Training checkpoints | TensorFlow Core

TensorFlow matches variables to checkpointed values by traversing a directed graph with named edges, starting from the object being loaded. Edge names typically ......

Saving Checkpoints during Training - PyKEEN - Read the Docs

Here we have defined a pipeline that will save training loop checkpoints in the checkpoint file called my_checkpoint.pt every time an epoch finishes...

Checkpoints | Data Version Control · DVC

The checkpoint file, specified with --model 'model.pt' , is an output from one checkpoint that becomes a dependency for the next checkpoint. The...

A Guide To Using Checkpoints — Ray 2.2.0

This topic is relevant to trial checkpoints. Tune stores checkpoints on the node where the trials are executed. If you are training on...

Checkpointing - Composer - MosaicML

To customize the filenames of checkpoints inside save_folder , you can set the save_filename argument. By default, checkpoints will be named like ...