Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Replace previous checkpoints while resuming training

See original GitHub issue

CheckpointSaver saves model checkpoints for key_metric_n_saved number of best validation score and n_savednumber of the last model. While resuming training, the filename buffer starts from empty. Thus it creates a whole new set of checkpoint files after each resume. Replacing the previous checkpoints would be a cool feature to have for conveniently resuming training. Thanks.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

vfdev-5commented, Feb 18, 2021

@suprosanna in ignite Checkpoint has a method state_dict/load_state_dict that could be saved to keep and be able to restore the “filename buffer”. We also provide a convenient argument include_self such that every time Checkpoint is used it includes its state_dict to the saved file. Then, when using stored checkpoint file to resume the training, we can pass required checkpoint handler to restore their states. Let us know if it helps. Thanks

0reactions

Nic-Macommented, Feb 19, 2021

Sounds good, I will try to enhance it later. Thanks.

Top Results From Across the Web

Saving checkpoints and resuming training in tensorflow

In the first phase, I'm running the loop for 100 times (by setting the value of the variable 'endIter = 100' in the...

Advanced Keras — Accurately Resuming a Training Process

In this post I will present a use case of the Keras API in which resuming a training process from a loaded checkpoint...

Resuming Training and Checkpoints in Python TensorFlow ...

... how to halt training and continue with Keras.Code for this video:https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_55.

Training checkpoints | TensorFlow Core

When executing a save, variables are gathered recursively from all of the reachable tracked objects. As with direct attribute assignments like self.l1 =...

Keras: Starting, stopping, and resuming training

Learning how to start, stop, and resume training a deep learning model is ... to a specific model checkpoint to load when resuming...