Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer always causes divergence after `Trainer.tuner.lr_find`

See original GitHub issue

🐛 Bug

We determine the learning rate using Lightnings LR finder as follows:

trainer = pl.Trainer(max_epochs=16, gpus=1, auto_lr_find=True)
lr_finder = trainer.tuner.lr_find(
    model, train_dataloaders=train_loader,
)
lr = lr_finder.suggestion() or 0.001

Then, we re-instantiate the module with the new learning rate and call trainer.fit() using the same Trainer instance as for finding the learning rate.

In PyTorch Lightning 1.5.x, this worked without issues. In Lightning 1.6, this always leads to divergence, regardless of the learning rate. Using a new Trainer instance works, but from my understanding, the trainer should still be usable after having called trainer.tuner.lr_find().

To Reproduce

The following Colab reproduces the issue with the Boring Model: https://colab.research.google.com/drive/12BF8m4bCe6ijWDQ1KCEXr5wNAL0TCyhW?usp=sharing

Changing the pytorch_lightning version in the first cell to 1.5.10 solves the issue.

Expected behavior

Calling trainer.fit() after trainer.tuner.lr_find() works just like calling trainer.fit() right after instantiating the trainer.

Environment

CUDA:
- GPU:
  - Tesla K80
- available: True
- version: 11.1
Packages:
- numpy: 1.21.5
- pyTorch_debug: False
- pyTorch_version: 1.10.0+cu111
- pytorch-lightning: 1.6.0
- tqdm: 4.63.0
System:
- OS: Linux
- architecture: 64bit
- processor: x86_64
- python: 3.7.13

cc @akihironitta @borda @rohitgr7

Issue Analytics

State:
Created a year ago
Reactions:4
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

Callidiorcommented, Aug 17, 2022

@krishnakalyan3 We did also observe the issue on our use-cases with real data, not just with the boring model. The actual problem does not occur on the call to lr_find, but when you try to re-use the trainer instance afterwards, e.g., for actually training your model, which would be the logical next step.

Since Lightning version 1.6, the LR finder does not reset the trainer properly, so that it will still use the LR scheduler used by the LR finder for trying increasing learning rates. This results in quick divergence.

A simple workaround is to create a new trainer instance. But I assume it should be possible to re-use the same trainer. Otherwise, it should be documented that the trainer is broken after lr_find().

I verified that the fix in #14113 resolves this issue. I also reproduced it on MNIST, as requested: https://colab.research.google.com/drive/1rPPOh_RjGEw9D_V7k57OhmbhtxUjmZ_t?usp=sharing Note that the loss becomes NaN already after the first epoch of training.