question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature request: resume training without overwriting model dir

See original GitHub issue

Currently it does not seem to be possible to resume a previous training where it left off, if there’s already a model directory.

If overwrite is set to False, https://github.com/joeynmt/joeynmt/blob/cd6974f862922757129fa7d50b5fd842baa996f0/joeynmt/helpers.py#L34 will simply error out.

If overwrite is set to True, it will wipe out the existing mdoel directory.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
cdleongcommented, Feb 8, 2022

Trying to think through desired behavior if we create a continue flag, and it now interacts with the overwrite flag.

directory exists already continue flag is set to… overwrite flag is set to… what do?
FALSE FALSE FALSE make the dir and train
FALSE FALSE TRUE make the dir and train
FALSE TRUE FALSE make the dir and train
FALSE TRUE TRUE make the dir and train
TRUE FALSE FALSE make an error and quit. User didn’t want continue, and didn’t want to overwrite. But let them know those are options
TRUE FALSE TRUE User wanted overwrite, so delete the dir and start fresh. If continue defaults to False, then this could happen by accident if someone reruns with the same config.
TRUE TRUE FALSE Continue previous training, don’t overwrite.
TRUE TRUE TRUE User has requested both continuing and overwriting, which don’t seem compatible. Why would you want both continue and overwrite? Maybe because they just reran the same command? Ask the user which behavior they want I guess?
0reactions
cdleongcommented, Mar 22, 2022

In related news: https://www.philschmid.de/sagemaker-spot-instance

someday having S3 checkpointing would be cool!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · joeynmt/joeynmt
Feature request : resume training without overwriting model dir enhancement New feature or request. #162 opened on Dec 14, 2021 by cdleong.
Read more >
Keras: Starting, stopping, and resuming training
Learning how to start, stop, and resume training a deep learning model is a super important skill to master — at some point...
Read more >
Trainer
The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the example...
Read more >
Checkpointing — PyTorch Lightning 1.6.3 documentation
Checkpointing your training allows you to resume a training process in case ... use a pre-trained model for inference without having to retrain...
Read more >
Create custom training jobs | Vertex AI
Alternatively, if you have already created a Python training application or custom container image, then skip ahead to the Without autopackaging section. With ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found