Trainer always causes divergence after `Trainer.tuner.lr_find`
See original GitHub issue🐛 Bug
We determine the learning rate using Lightnings LR finder as follows:
trainer = pl.Trainer(max_epochs=16, gpus=1, auto_lr_find=True)
lr_finder = trainer.tuner.lr_find(
model, train_dataloaders=train_loader,
)
lr = lr_finder.suggestion() or 0.001
Then, we re-instantiate the module with the new learning rate and call trainer.fit()
using the same Trainer instance as for finding the learning rate.
In PyTorch Lightning 1.5.x, this worked without issues. In Lightning 1.6, this always leads to divergence, regardless of the learning rate. Using a new Trainer instance works, but from my understanding, the trainer should still be usable after having called trainer.tuner.lr_find()
.
To Reproduce
The following Colab reproduces the issue with the Boring Model: https://colab.research.google.com/drive/12BF8m4bCe6ijWDQ1KCEXr5wNAL0TCyhW?usp=sharing
Changing the pytorch_lightning version in the first cell to 1.5.10 solves the issue.
Expected behavior
Calling trainer.fit()
after trainer.tuner.lr_find()
works just like calling trainer.fit()
right after instantiating the trainer.
Environment
- CUDA:
- GPU:
- Tesla K80
- available: True
- version: 11.1
- GPU:
- Packages:
- numpy: 1.21.5
- pyTorch_debug: False
- pyTorch_version: 1.10.0+cu111
- pytorch-lightning: 1.6.0
- tqdm: 4.63.0
- System:
- OS: Linux
- architecture: 64bit
- processor: x86_64
- python: 3.7.13
Issue Analytics
- State:
- Created a year ago
- Reactions:4
- Comments:7 (4 by maintainers)
@krishnakalyan3 We did also observe the issue on our use-cases with real data, not just with the boring model. The actual problem does not occur on the call to
lr_find
, but when you try to re-use the trainer instance afterwards, e.g., for actually training your model, which would be the logical next step.Since Lightning version 1.6, the LR finder does not reset the trainer properly, so that it will still use the LR scheduler used by the LR finder for trying increasing learning rates. This results in quick divergence.
A simple workaround is to create a new trainer instance. But I assume it should be possible to re-use the same trainer. Otherwise, it should be documented that the trainer is broken after
lr_find()
.I verified that the fix in #14113 resolves this issue. I also reproduced it on MNIST, as requested: https://colab.research.google.com/drive/1rPPOh_RjGEw9D_V7k57OhmbhtxUjmZ_t?usp=sharing Note that the loss becomes NaN already after the first epoch of training.
Thanks this makes sense.