question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The training loss(logging steps) will drop suddenly after each epoch? Help me plz! Orz

See original GitHub issue

System Info

transformers version: 4.17.0 Python version: 3.7.0 torch version: 1.10.1

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

CLIP(https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text).

I have implemented a Dataset to train, but i have found that after each epoch the training loss will drop suddenly. The Dataset overrides three methods(init, getitem and len) and i couldn’t figure out the reason for the above phenomenon.

I think the data is shuffled properly(checked) and the learning_rate drops smoothly(observed). I would appreciate it if you could afford time to help me.

The picture is drawn according to the trainer_state.json

image

Expected behavior

Figure out the reason.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:19

github_iconTop GitHub Comments

3reactions
n9Mtq4commented, Aug 29, 2022

I don’t think it’s related to CLIP as I’ve seen this happen with multiple models. Here’s the training loss with OPT-350M. OPT-1.3B and GPT-neo-125M also had this behavior. The larger the model, the larger the loss drops were. Unfortunately, I no longer have the tensorboard logs for those runs. I’ve also seen this to a lesser degree with a large MLP and my own pytorch training loop, so I don’t think it’s an issue with HF transformers.

opt-350m training loss

The model is learning every at step, not just the start of the epoch. Consider having a dataset of: a, b, c and learning from one data point doesn’t help improve the predictions for the others (Ex If learning from data point a doesn’t help improve the prediction on data points b and c). Let’s look at what training a model on this dataset would look like.

For each epoch the data is shuffled, but it will go through all the data before repeating. So for 4 epochs, an example training session could look like this:

Step Epoch Data Point Number of times the model has seen the data point Loss
1 1 c 0 5
2 1 b 0 5
3 1 a 0 5
4 2 b 1 4
5 2 a 1 4
6 2 c 1 4
7 3 c 2 3
8 3 b 2 3
9 3 a 2 3
10 4 a 3 2
11 4 b 3 2
12 4 c 3 2

Notice that during any single epoch, since one data point doesn’t improve the predictions for others, the loss stays the same. But as the model trains, it remembers the correct prediction for each data point, so that when it sees it again (which happens in the next epoch) it will produce a better prediction for that specific data point. So the loss for any given data point is correlated with how many times it has seen that specific data point which increments at the start of each epoch.

As for why it’s not happening with your BERT model, perhaps the model is too small, you have sufficient data to prevent memorization, or the dataset doesn’t have this property.

I’ll point out again that this is my best guess to why this is happening and I haven’t done any experimentation to confirm that this is the reason. You could try training by sampling your dataset with replacement so that a single data point could appear multiple times in the same epoch. I would expect that the drop in loss at epoch starts wouldn’t be visible, although the memorization would still occur.

1reaction
lchwhutcommented, Aug 24, 2022

Thanks. The Trainer does reset the loss in function “_maybe_log_save_evaluate”. But I still don’t understand this phenomenon, because i get smooth loss curve when i train BERT with Trainer and Dataset. Anyway, I’ll figured it out myself, thank a lot! image image

Read more comments on GitHub >

github_iconTop Results From Across the Web

The training loss(logging steps) will ... - Hugging Face Forums
I checked the logged values for each step(according to trainer_state.json) and found the loss value(dataloader_drop_last=True) dropped ...
Read more >
Why there is sudden drop in loss after every epoch?
At the beginning of training, the loss will usually be quite large. It will then decrease, but the displayed value for epoch 1...
Read more >
Training loss decreases, then suddenly increases, then ...
Here's one possible interpretation of your loss function's behavior: At the beginning, loss decreases healthily.
Read more >
Why is my validation loss lower than my training loss?
Reason #2: Training loss is measured during each epoch while validation loss is measured after each epoch. On average, the training loss is ......
Read more >
Training loss not decrease after certain epochs - Kaggle
For anyone who is having this problem, it is most likely because your model is predicting the same value for every input, which...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found