question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

After resuming traing scheduler.step() will not update optimzer's learning rate

See original GitHub issue

I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I found that the method scheduler.step() will not change the learning rate of optimizer. So I set it manually to avoid this bug.

    def on_epoch_start(self) -> None:
        self.optimizers().param_groups[0]['lr'] = self.lr_schedulers().get_lr()[0]

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:12 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
FrankZijlstracommented, Oct 17, 2022

I had the same issue and looked into it a little bit. It turns out that by default self.optimizers() returns from trainer.strategy._lightning_optimizers, and LightningOptimizer maintains a copy of the param_groups field. The parameters all are stored as references to the actual parameters, but the learning rate is not. This behaviour traces back to load_state_dict of the pytorch optimizer, which overwrites the param_groups list with a list from state dict, but it plugs back in the ‘params’ value. So at that point the copy of param_groups maintained by LightningOptimizer is no longer kept up-to-date.

I think a simple solution would be to have the strategy create/update its _lightning_optimizers after a restore from checkpoint. As a user, you can call self.optimizers(use_pl_optimizer=False).param_groups[0]['lr'] instead to fix the issue for now, though I don’t know if not using the LightningOptimizer wrapper will have side effects when using the various training strategies.

Little example: After a fit() which restored from a checkpoint it looks like this (with a LR of 1e-4, and a scheduler starting at factor 1e-3):

trainer.optimizers[0].param_groups[0]['lr']
Out[36]: 0.00010000000000000009

trainer.strategy._lightning_optimizers[0].param_groups[0]['lr']
Out[37]: 1.0000000000000001e-07
1reaction
lanslotttTTcommented, Apr 20, 2022

I have checked that schedule and Optimizer have different learning rates.Schedule’s learning rate is correct, but the Optimizer’s learning rate cannot be updated by schedule

---- 回复的原邮件 ---- | 发件人 | Rohit @.> | | 日期 | 2022年04月20日 16:27 | | 收件人 | @.> | | 抄送至 | @.@.> | | 主题 | Re: [PyTorchLightning/pytorch-lightning] After resuming traing scheduler.step() will not update optimzer’s learning rate (Issue #12812) |

did you check the actual learning here?

self.optimizers().param_groups[0][‘lr’]

since while resuming the optimizer’s state is also restored which includes the learning rate as well.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Read more comments on GitHub >

github_iconTop Results From Across the Web

After resuming traing scheduler.step() will not update ... - GitHub
I find a bug that when I resume training from a checkpoint ,the learning rate always equals the init_lr I set.After debugging, I...
Read more >
PyTorch: How to change the learning rate of an optimizer at ...
Now due to some tests which I perform during training, I realize my learning rate is too high so I want to change...
Read more >
Learning Rate Schedulers — DeepSpeed 0.8.0 documentation
LRRT changes the learning rate after every batch. step should be called after a batch has been used for training. Parameters. optimizer (Optimizer)...
Read more >
Optimization - Hugging Face
Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end...
Read more >
Loading optimizer dict starts training from initial LR
Note that if you have not saved the scheduler, you can still fix this problem. Just add a for-loop to add steps to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found