Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Loss drops to 0 after a few thousand steps when using fp16=True

See original GitHub issue

The model training loss is suddenly dropping to 0 after over 1000 steps. I’ve tried iterating over different dataset as well but got the same behaviour.

Details

I am following the notebook Transformers4Rec/examples/tutorial, to train a next item click prediction model for my own dataset on sequence of items.

Params which I’ve changed are: learning_rate=0.01, fp16=True, per_device_train_batch_size = 64, d_model=16, rest are as in the notebook. Following are the logs for 1st day of data.

{'loss': 14.3249, 'learning_rate': 0.009976389135451803, 'epoch': 0.01}
{'loss': 14.083, 'learning_rate': 0.009964554115628145, 'epoch': 0.01}
{'loss': 13.9319, 'learning_rate': 0.009953074146399196, 'epoch': 0.01}
{'loss': 13.8982, 'learning_rate': 0.009947452511982957, 'epoch': 0.02}
{'loss': 12.6002, 'learning_rate': 0.009938812947511687, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.009926977927688029, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.00991514290786437, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.009903307888040712, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.009891472868217054, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.009879637848393396, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.009867802828569739, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.00985596780874608, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009844132788922422, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009832297769098764, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.009820462749275106, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.009808627729451447, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.00979679270962779, 'epoch': 0.06}
{'loss': 0.0, 'learning_rate': 0.00978495768980413, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009773122669980473, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009761287650156814, 'epoch': 0.07}
{'loss': 0.0, 'learning_rate': 0.009749452630333156, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.009737617610509498, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.00972578259068584, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009713947570862181, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009702112551038523, 'epoch': 0.09}
{'loss': 0.0, 'learning_rate': 0.009690277531214864, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.009678442511391206, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.009666607491567548, 'epoch': 0.1}
{'loss': 0.0, 'learning_rate': 0.00965477247174389, 'epoch': 0.11}
{'loss': 0.0, 'learning_rate': 0.009642937451920231, 'epoch': 0.11}```

Additionally, I am using merlin container nvcr.io/nvidia/merlin/merlin-pytorch-training:22.05 for training.

Any suggestions on what might be the issue here?

Issue Analytics

State:
Created a year ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

rnyakcommented, Oct 19, 2022

@silpara I am turning this to a bug ticket so that we can follow up with it.

0reactions

silparacommented, Oct 7, 2022

Training with fp16=False does seem to work fine.

Top Results From Across the Web

loss explodes after few iterations · Issue #3868 - GitHub

I'm fine-tunning faster-rcnn-resnet101-coco model with my own dataset using 7 labels. my initial learning rate is 0.0003. I don't see a standard ...

Fine-tuning a masked language model - Hugging Face Course

This process of fine-tuning a pretrained language model on in-domain data is ... be trained much faster with little to no loss in...

Training loss goes down and up again. What is happening?

Your learning rate could be to big after the 25th epoch. This problem is easy to identify. You just need to set up...

Efficient machine translation - XapaJIaMnu

In this tutorial we will learn how to produce efficient machine translation models, with Marian as our NMT engine of choice, but the...

uWaterloo LaTeX Thesis Template

Another approach is learning the repair process with machine learning models ... the G&V technique helps developers fixing bugs easier and faster in...