The training loss(logging steps) will drop suddenly after each epoch? Help me plz! Orz
See original GitHub issueSystem Info
transformers version: 4.17.0 Python version: 3.7.0 torch version: 1.10.1
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
CLIP(https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text).
I have implemented a Dataset to train, but i have found that after each epoch the training loss will drop suddenly. The Dataset overrides three methods(init, getitem and len) and i couldn’t figure out the reason for the above phenomenon.
I think the data is shuffled properly(checked) and the learning_rate drops smoothly(observed). I would appreciate it if you could afford time to help me.
The picture is drawn according to the trainer_state.json
Expected behavior
Figure out the reason.
Issue Analytics
- State:
- Created a year ago
- Comments:19
Top Results From Across the Web
The training loss(logging steps) will ... - Hugging Face Forums
I checked the logged values for each step(according to trainer_state.json) and found the loss value(dataloader_drop_last=True) dropped ...
Read more >Why there is sudden drop in loss after every epoch?
At the beginning of training, the loss will usually be quite large. It will then decrease, but the displayed value for epoch 1...
Read more >Training loss decreases, then suddenly increases, then ...
Here's one possible interpretation of your loss function's behavior: At the beginning, loss decreases healthily.
Read more >Why is my validation loss lower than my training loss?
Reason #2: Training loss is measured during each epoch while validation loss is measured after each epoch. On average, the training loss is ......
Read more >Training loss not decrease after certain epochs - Kaggle
For anyone who is having this problem, it is most likely because your model is predicting the same value for every input, which...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I don’t think it’s related to CLIP as I’ve seen this happen with multiple models. Here’s the training loss with OPT-350M. OPT-1.3B and GPT-neo-125M also had this behavior. The larger the model, the larger the loss drops were. Unfortunately, I no longer have the tensorboard logs for those runs. I’ve also seen this to a lesser degree with a large MLP and my own pytorch training loop, so I don’t think it’s an issue with HF transformers.
The model is learning every at step, not just the start of the epoch. Consider having a dataset of: a, b, c and learning from one data point doesn’t help improve the predictions for the others (Ex If learning from data point a doesn’t help improve the prediction on data points b and c). Let’s look at what training a model on this dataset would look like.
For each epoch the data is shuffled, but it will go through all the data before repeating. So for 4 epochs, an example training session could look like this:
Notice that during any single epoch, since one data point doesn’t improve the predictions for others, the loss stays the same. But as the model trains, it remembers the correct prediction for each data point, so that when it sees it again (which happens in the next epoch) it will produce a better prediction for that specific data point. So the loss for any given data point is correlated with how many times it has seen that specific data point which increments at the start of each epoch.
As for why it’s not happening with your BERT model, perhaps the model is too small, you have sufficient data to prevent memorization, or the dataset doesn’t have this property.
I’ll point out again that this is my best guess to why this is happening and I haven’t done any experimentation to confirm that this is the reason. You could try training by sampling your dataset with replacement so that a single data point could appear multiple times in the same epoch. I would expect that the drop in loss at epoch starts wouldn’t be visible, although the memorization would still occur.
Thanks. The Trainer does reset the loss in function “_maybe_log_save_evaluate”. But I still don’t understand this phenomenon, because i get smooth loss curve when i train BERT with Trainer and Dataset. Anyway, I’ll figured it out myself, thank a lot!