Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

After resuming traing scheduler.step() will not update optimzer's learning rate

See original GitHub issue

I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I found that the method scheduler.step() will not change the learning rate of optimizer. So I set it manually to avoid this bug.

    def on_epoch_start(self) -> None:
        self.optimizers().param_groups[0]['lr'] = self.lr_schedulers().get_lr()[0]

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:12 (2 by maintainers)

Top GitHub Comments

3reactions

FrankZijlstracommented, Oct 17, 2022

I had the same issue and looked into it a little bit. It turns out that by default self.optimizers() returns from trainer.strategy._lightning_optimizers, and LightningOptimizer maintains a copy of the param_groups field. The parameters all are stored as references to the actual parameters, but the learning rate is not. This behaviour traces back to load_state_dict of the pytorch optimizer, which overwrites the param_groups list with a list from state dict, but it plugs back in the ‘params’ value. So at that point the copy of param_groups maintained by LightningOptimizer is no longer kept up-to-date.

I think a simple solution would be to have the strategy create/update its _lightning_optimizers after a restore from checkpoint. As a user, you can call self.optimizers(use_pl_optimizer=False).param_groups[0]['lr'] instead to fix the issue for now, though I don’t know if not using the LightningOptimizer wrapper will have side effects when using the various training strategies.

Little example: After a fit() which restored from a checkpoint it looks like this (with a LR of 1e-4, and a scheduler starting at factor 1e-3):

trainer.optimizers[0].param_groups[0]['lr']
Out[36]: 0.00010000000000000009

trainer.strategy._lightning_optimizers[0].param_groups[0]['lr']
Out[37]: 1.0000000000000001e-07

1reaction

lanslotttTTcommented, Apr 20, 2022

I have checked that schedule and Optimizer have different learning rates.Schedule’s learning rate is correct, but the Optimizer’s learning rate cannot be updated by schedule

---- 回复的原邮件 ---- | 发件人 | Rohit @.> | | 日期 | 2022年04月20日 16:27 | | 收件人 | @.> | | 抄送至 | @.@.> | | 主题 | Re: [PyTorchLightning/pytorch-lightning] After resuming traing scheduler.step() will not update optimzer’s learning rate (Issue #12812) |

did you check the actual learning here?

self.optimizers().param_groups[0][‘lr’]

since while resuming the optimizer’s state is also restored which includes the learning rate as well.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Top Results From Across the Web

After resuming traing scheduler.step() will not update ... - GitHub

I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I...

PyTorch: How to change the learning rate of an optimizer at ...

Now due to some tests which I perform during training, I realize my learning rate is too high so I want to change...

Learning Rate Schedulers — DeepSpeed 0.8.0 documentation

LRRT changes the learning rate after every batch. step should be called after a batch has been used for training. Parameters. optimizer (Optimizer)...

Optimization - Hugging Face

Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end...

Loading optimizer dict starts training from initial LR

Note that if you have not saved the scheduler, you can still fix this problem. Just add a for-loop to add steps to...