[BUG] Loss drops to 0 after a few thousand steps when using fp16=True
See original GitHub issueThe model training loss is suddenly dropping to 0 after over 1000 steps. I’ve tried iterating over different dataset as well but got the same behaviour.
Details
I am following the notebook Transformers4Rec/examples/tutorial, to train a next item click prediction model for my own dataset on sequence of items.
Params which I’ve changed are: learning_rate=0.01, fp16=True, per_device_train_batch_size = 64, d_model=16, rest are as in the notebook. Following are the logs for 1st day of data.
{'loss': 14.3249, 'learning_rate': 0.009976389135451803, 'epoch': 0.01}
{'loss': 14.083, 'learning_rate': 0.009964554115628145, 'epoch': 0.01}
{'loss': 13.9319, 'learning_rate': 0.009953074146399196, 'epoch': 0.01}
{'loss': 13.8982, 'learning_rate': 0.009947452511982957, 'epoch': 0.02}
{'loss': 12.6002, 'learning_rate': 0.009938812947511687, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.009926977927688029, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.00991514290786437, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.009903307888040712, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.009891472868217054, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.009879637848393396, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.009867802828569739, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.00985596780874608, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009844132788922422, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009832297769098764, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009820462749275106, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.009808627729451447, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.00979679270962779, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.00978495768980413, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009773122669980473, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009761287650156814, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009749452630333156, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.009737617610509498, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.00972578259068584, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009713947570862181, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009702112551038523, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009690277531214864, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.009678442511391206, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.009666607491567548, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.00965477247174389, 'epoch': 0.11}
{'loss': 0.0, 'learning_rate': 0.009642937451920231, 'epoch': 0.11}```
Additionally, I am using merlin container nvcr.io/nvidia/merlin/merlin-pytorch-training:22.05 for training.
Any suggestions on what might be the issue here?
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
loss explodes after few iterations · Issue #3868 - GitHub
I'm fine-tunning faster-rcnn-resnet101-coco model with my own dataset using 7 labels. my initial learning rate is 0.0003. I don't see a standard ...
Read more >Fine-tuning a masked language model - Hugging Face Course
This process of fine-tuning a pretrained language model on in-domain data is ... be trained much faster with little to no loss in...
Read more >Training loss goes down and up again. What is happening?
Your learning rate could be to big after the 25th epoch. This problem is easy to identify. You just need to set up...
Read more >Efficient machine translation - XapaJIaMnu
In this tutorial we will learn how to produce efficient machine translation models, with Marian as our NMT engine of choice, but the...
Read more >uWaterloo LaTeX Thesis Template
Another approach is learning the repair process with machine learning models ... the G&V technique helps developers fixing bugs easier and faster in...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@silpara I am turning this to a bug ticket so that we can follow up with it.
Training with
fp16=False
does seem to work fine.