question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature request] Save a checkpoint when interrupting the training (ctrl - c)

See original GitHub issue

🚀 Feature Description Hi, Sometimes while training, I need my GPU (eg to do some work with Whisper, or because I need to switch off the computer). So I have to interrupt the training and sometimes it is right between 2 checkpoints (eg checkpoints are saved every 10k iterations and it is 7k after the previous saved checkpoints). So in this case I would loose all the training that has been achieved after the previous checkpoint.

Consequently it would be more comfortable that a checkpoint is saved when I interrupt the training so that I can then restore the training right from this checkpoint.

Solution

When the training process is interrupted (ctrl-c) make coqui save a checkpoint at the current step (as it does when save_step is reached).

Alternative Solutions

I could lower save_step but then checkpoints are too near to each others.

Additional context

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
p0p4kcommented, Dec 17, 2022

Yes, I am on discord in the coqui channel. I’ll ping there.

0reactions
erogolcommented, Dec 17, 2022

@p0p4k found you here… are you on discord? I want to add you to the contributor’s list if you want.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Checkpointing — PyTorch Lightning 1.6.2 documentation
Lightning provides functions to save and load checkpoints. Checkpointing your training allows you to resume a training process in case it was interrupted, ......
Read more >
Keras: Starting, stopping, and resuming training
In this tutorial, you will learn how to use Keras to train a neural network, stop training, update your learning rate, and then...
Read more >
Stopping and Resuming a Tune Run - the Ray documentation
Tune first looks at the experiment-level checkpoint to find the list of trials at the time of the interruption. Tune then locates and...
Read more >
Autoresume Training - Composer
The first time the notebook is run, the trainer will save a checkpoint to the ... interrupting the notebook (e.g. Ctrl-C) midway through...
Read more >
What is FW Monitor? - Check Point Support Center
... or will save them in the output capture file. Upon an interrupt signal (key combination CTRL + C), the FW Monitor stops,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found