question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Allow `extra_epochs` flag in `Trainer.fit` to control finetuning time

See original GitHub issue

🚀 Feature

Trainer(max_epochs=100).fit(model, train_dl, ckpt_path=ckpt_path, extra_epochs=True) would finetune for 100 epochs

Motivation

Finetuning for N epochs requires knowing the previous number of epochs M and setting Trainer(max_epochs=M+N). Google did not tell me how to achieve this.

Pitch

Finetuning training time or number of epochs should be configurable.

Alternatives

Setting many epochs and manually stopping

Additional context

It would be cool with max_time too. I hope this is already solved and this issue is unnecessary.


cc @justusschock @kaushikb11 @awaelchli @borda @rohitgr7

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
carmoccacommented, Jun 14, 2022

You accomplish this by doing:

trainer.fit_loop.max_epochs += 100

before trainer.fit() is called

0reactions
carmoccacommented, Jul 26, 2022

There are 2 potential solutions:

  1. Pre-load the checkpoint manually
ckpt = torch.load(...)
current_epoch = ckpt["current_epoch"]
trainer = Trainer(max_epochs=current_epoch + N)

An issue with this method is that it loads the fully checkpoint just for this change. This relates to #5339 and https://github.com/Lightning-AI/lightning/issues/12712

  1. Extract the state from the checkpoint in on_load_checkpoint and modify the Trainer’s max_epochs. This requires editing the LightningModule hook to do this or creating a Callback just for it.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Trainer — PyTorch Lightning 1.8.5.post0 documentation
Running the training, validation and test dataloaders. Calling the Callbacks at the appropriate times. Putting batches and computations on the correct devices.
Read more >
Trainer - Hugging Face
Trainer. The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the...
Read more >
Is my training finetuing RoBERTa normal? · Issue #999 - GitHub
I followed the official instruction finetune_custom_classification.md. The ACC of mini-batchs is only 72 after 4.5 epochs and there is...
Read more >
Transformer Model — darts documentation - GitHub Pages
For more information on PyTorch Lightning Trainers check out this link . This function can be called several times to do some extra...
Read more >
Training (tune.Trainable, session.report) — Ray 2.2.0
The Function API allows you to define a custom training function that Tune will run in parallel Ray actor processes, one for each...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found