Error when using Adafactor without learn rate
See original GitHub issueHi, I get these strange errors when I use the Adafactor. This code will result in this (expected) error:
optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=1e-4)
ValueError: Cannot combine manual
lr
andrelative_step=True
options
however, if I do not set a manual learn rate I get a different error. Btw: This code is recommended in the documentation.
optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
# same for
optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True)
will return this error
TypeError: unsupported operand type(s) for *: ‘NoneType’ and ‘float’
Environment info
transformers
version: 4.5.1- Platform: Linux
- Python version: 3.7.1
- PyTorch version (GPU?): 1.8.0+cu111 and 1.8.1+cu111
- Tensorflow version (GPU?): -
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
Trainer: @sgugger
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
LearningRateMonitor fails to track learning rate when using ...
Bug When I used AdamW for transformer training, LearningRateMonitor ... However, when using Adafactor, it does not track the learning rate.
Read more >Optimization - Hugging Face
Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the...
Read more >[Paper] Adafactor: Adaptive Learning Rates with Sublinear ...
However, I encountered problems when using a custom learning rate scheduler with ... Adafactor can converge normally even without momentum.
Read more >Adafactor: Adaptive Learning Rates with Sublinear ... - arXiv
Adafactor : Adaptive Learning Rates with Sublinear Memory Cost this instability and propose two remedies. Finally, while the learning rate in ...
Read more >tf.keras.optimizers.experimental.Adafactor | TensorFlow v2.11.0
This is a default learning rate decay in Adafactor. name, String. The name to use for momentum accumulator weights created by the optimizer....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This was added by @jsrozner and @stas00 in #10526, so pinging them here.
@oliverguhr, we went with the
AdafactorSchedule
- please check that it works for youhttps://github.com/huggingface/transformers/pull/12123