question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Replace previous checkpoints while resuming training

See original GitHub issue

CheckpointSaver saves model checkpoints for key_metric_n_saved number of best validation score and n_savednumber of the last model. While resuming training, the filename buffer starts from empty. Thus it creates a whole new set of checkpoint files after each resume. Replacing the previous checkpoints would be a cool feature to have for conveniently resuming training. Thanks.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
vfdev-5commented, Feb 18, 2021

@suprosanna in ignite Checkpoint has a method state_dict/load_state_dict that could be saved to keep and be able to restore the “filename buffer”. We also provide a convenient argument include_self such that every time Checkpoint is used it includes its state_dict to the saved file. Then, when using stored checkpoint file to resume the training, we can pass required checkpoint handler to restore their states. Let us know if it helps. Thanks

0reactions
Nic-Macommented, Feb 19, 2021

Sounds good, I will try to enhance it later. Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Saving checkpoints and resuming training in tensorflow
In the first phase, I'm running the loop for 100 times (by setting the value of the variable 'endIter = 100' in the...
Read more >
Advanced Keras — Accurately Resuming a Training Process
In this post I will present a use case of the Keras API in which resuming a training process from a loaded checkpoint...
Read more >
Resuming Training and Checkpoints in Python TensorFlow ...
... how to halt training and continue with Keras.Code for this video:https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_55.
Read more >
Training checkpoints | TensorFlow Core
When executing a save, variables are gathered recursively from all of the reachable tracked objects. As with direct attribute assignments like self.l1 =...
Read more >
Keras: Starting, stopping, and resuming training
Learning how to start, stop, and resume training a deep learning model is ... to a specific model checkpoint to load when resuming...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found