question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer always causes divergence after `Trainer.tuner.lr_find`

See original GitHub issue

🐛 Bug

We determine the learning rate using Lightnings LR finder as follows:

trainer = pl.Trainer(max_epochs=16, gpus=1, auto_lr_find=True)
lr_finder = trainer.tuner.lr_find(
    model, train_dataloaders=train_loader,
)
lr = lr_finder.suggestion() or 0.001

Then, we re-instantiate the module with the new learning rate and call trainer.fit() using the same Trainer instance as for finding the learning rate.

In PyTorch Lightning 1.5.x, this worked without issues. In Lightning 1.6, this always leads to divergence, regardless of the learning rate. Using a new Trainer instance works, but from my understanding, the trainer should still be usable after having called trainer.tuner.lr_find().

To Reproduce

The following Colab reproduces the issue with the Boring Model: https://colab.research.google.com/drive/12BF8m4bCe6ijWDQ1KCEXr5wNAL0TCyhW?usp=sharing

Changing the pytorch_lightning version in the first cell to 1.5.10 solves the issue.

Expected behavior

Calling trainer.fit() after trainer.tuner.lr_find() works just like calling trainer.fit() right after instantiating the trainer.

Environment

  • CUDA:
    • GPU:
      • Tesla K80
    • available: True
    • version: 11.1
  • Packages:
    • numpy: 1.21.5
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0+cu111
    • pytorch-lightning: 1.6.0
    • tqdm: 4.63.0
  • System:
    • OS: Linux
    • architecture: 64bit
    • processor: x86_64
    • python: 3.7.13

cc @akihironitta @borda @rohitgr7

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:4
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
Callidiorcommented, Aug 17, 2022

@krishnakalyan3 We did also observe the issue on our use-cases with real data, not just with the boring model. The actual problem does not occur on the call to lr_find, but when you try to re-use the trainer instance afterwards, e.g., for actually training your model, which would be the logical next step.

Since Lightning version 1.6, the LR finder does not reset the trainer properly, so that it will still use the LR scheduler used by the LR finder for trying increasing learning rates. This results in quick divergence.

A simple workaround is to create a new trainer instance. But I assume it should be possible to re-use the same trainer. Otherwise, it should be documented that the trainer is broken after lr_find().

I verified that the fix in #14113 resolves this issue. I also reproduced it on MNIST, as requested: https://colab.research.google.com/drive/1rPPOh_RjGEw9D_V7k57OhmbhtxUjmZ_t?usp=sharing Note that the loss becomes NaN already after the first epoch of training.

image image

0reactions
krishnakalyan3commented, Aug 17, 2022

Thanks this makes sense.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Proposal for help · Issue #1 · Lightning-AI/lightning - GitHub
First step to helping is for you to do the following things and ... Trainer always causes divergence after Trainer.tuner.lr_find #12670.
Read more >
Changelog — PyTorch Lightning 1.8.5 documentation
Fixed a bug that caused the callbacks Trainer argument to reference a global variable (#1534). Fixed a bug that set all boolean CLI...
Read more >
AutoLRFinder - #18 by sgugger - fastai dev - Fast.ai forums
This behavior mostly happens during later stages of fine tuning in my ... the phenomena: started training - interrupted it - called lr_find....
Read more >
Changing the checkpoint path of lr_find - Stack Overflow
However, trainer.tuner.lr_find tries to write the checkpoint to the folder where my script runs and since this folder is not writable, ...
Read more >
Hyperparameter Optimization for Transfer Learning
Objective function for fine-tuning a language model ... In this example it is accuracy but it could be training or validation loss, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found