Large models aren't converge while fine-tuning
See original GitHub issueI tried to fine-tune XLM-Roberta Large model on Google Colab environment for 3 epochs using 1e-5
learning rate, 16 batch size, 2 accumulative steps and 120 warmup steps. But the loss didn’t converge and the model gives random predictions after the fine-tuning.
I used the sentence pair minimal example as a starting point.
Do you have any idea?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:19 (11 by maintainers)
Top Results From Across the Web
Advanced Techniques for Fine-tuning Transformers
Learn these techniques for fine-tuning BERT, RoBERTa, etc. Layer-wise Learning Rate Decay (LLRD) Warm-up Steps Re-initializing Layers ...
Read more >Transfer learning and fine-tuning | TensorFlow Core
It is critical to only do this step after the model with frozen layers has been trained to convergence. If you mix randomly-initialized ......
Read more >How To Fit a Bigger Model and Train It Faster - Hugging Face
However, a larger batch size can often result in faster model convergence or better end performance. So ideally we want to tune the...
Read more >Fine-tuning your model | Chan`s Jupyter
C C C controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large C...
Read more >Models that converged before aren't converging anymore in ...
I can even load the saved model and weights that work. When I train more with the exact same model, the performance actually...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
in my case carefully adjusting the learning rate (with the existing scheduler) along with the number of epochs (as well as the previous increase in batch size) allowed me to get much better results (beating those with the base model to date). So there doesn’t appear to be anything fundamentally wrong with the pre-trained model or core model code. Seems you need to do a broader sweep of the parameters in your case (assuming no data issues etc).
I understand that. But you should add checkpoints within epochs. This won’t improve your model but it will give you more insights on what’s going on really since you will get way more metrics at different moments of the process.
I’m about to re-run XLM Roberta on my binary classifier and I will report what worked for me.