Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The training loss(logging steps) will drop suddenly after each epoch? Help me plz! Orz

See original GitHub issue

System Info

transformers version: 4.17.0 Python version: 3.7.0 torch version: 1.10.1

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

CLIP(https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text).

I have implemented a Dataset to train, but i have found that after each epoch the training loss will drop suddenly. The Dataset overrides three methods(init, getitem and len) and i couldn’t figure out the reason for the above phenomenon.

I think the data is shuffled properly(checked) and the learning_rate drops smoothly(observed). I would appreciate it if you could afford time to help me.

The picture is drawn according to the trainer_state.json

Expected behavior

Figure out the reason.

Issue Analytics

State:
Created a year ago
Comments:19

Top GitHub Comments

3reactions

n9Mtq4commented, Aug 29, 2022

I don’t think it’s related to CLIP as I’ve seen this happen with multiple models. Here’s the training loss with OPT-350M. OPT-1.3B and GPT-neo-125M also had this behavior. The larger the model, the larger the loss drops were. Unfortunately, I no longer have the tensorboard logs for those runs. I’ve also seen this to a lesser degree with a large MLP and my own pytorch training loop, so I don’t think it’s an issue with HF transformers.

opt-350m training loss

The model is learning every at step, not just the start of the epoch. Consider having a dataset of: a, b, c and learning from one data point doesn’t help improve the predictions for the others (Ex If learning from data point a doesn’t help improve the prediction on data points b and c). Let’s look at what training a model on this dataset would look like.

For each epoch the data is shuffled, but it will go through all the data before repeating. So for 4 epochs, an example training session could look like this:

Step	Epoch	Data Point	Number of times the model has seen the data point	Loss
1	1	c	0	5
2	1	b	0	5
3	1	a	0	5
4	2	b	1	4
5	2	a	1	4
6	2	c	1	4
7	3	c	2	3
8	3	b	2	3
9	3	a	2	3
10	4	a	3	2
11	4	b	3	2
12	4	c	3	2

Notice that during any single epoch, since one data point doesn’t improve the predictions for others, the loss stays the same. But as the model trains, it remembers the correct prediction for each data point, so that when it sees it again (which happens in the next epoch) it will produce a better prediction for that specific data point. So the loss for any given data point is correlated with how many times it has seen that specific data point which increments at the start of each epoch.

As for why it’s not happening with your BERT model, perhaps the model is too small, you have sufficient data to prevent memorization, or the dataset doesn’t have this property.

I’ll point out again that this is my best guess to why this is happening and I haven’t done any experimentation to confirm that this is the reason. You could try training by sampling your dataset with replacement so that a single data point could appear multiple times in the same epoch. I would expect that the drop in loss at epoch starts wouldn’t be visible, although the memorization would still occur.

1reaction

lchwhutcommented, Aug 24, 2022

Thanks. The Trainer does reset the loss in function “_maybe_log_save_evaluate”. But I still don’t understand this phenomenon, because i get smooth loss curve when i train BERT with Trainer and Dataset. Anyway, I’ll figured it out myself, thank a lot!

Top Results From Across the Web

The training loss(logging steps) will ... - Hugging Face Forums

I checked the logged values for each step(according to trainer_state.json) and found the loss value(dataloader_drop_last=True) dropped ...

Why there is sudden drop in loss after every epoch?

At the beginning of training, the loss will usually be quite large. It will then decrease, but the displayed value for epoch 1...

Training loss decreases, then suddenly increases, then ...

Here's one possible interpretation of your loss function's behavior: At the beginning, loss decreases healthily.

Why is my validation loss lower than my training loss?

Reason #2: Training loss is measured during each epoch while validation loss is measured after each epoch. On average, the training loss is ......

Training loss not decrease after certain epochs - Kaggle

For anyone who is having this problem, it is most likely because your model is predicting the same value for every input, which...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

The training loss(logging steps) will drop suddenly after each epoch? Help me plz! Orz

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Add image-guided object detection support to OWL-ViT

Padding offsets mapping via `tokenizer.pad`

Step	Epoch	Data Point	Number of times the model has seen the data point	Loss
1	1	c	0	5
2	1	b	0	5
3	1	a	0	5
4	2	b	1	4
5	2	a	1	4
6	2	c	1	4
7	3	c	2	3
8	3	b	2	3
9	3	a	2	3
10	4	a	3	2
11	4	b	3	2
12	4	c	3	2

Step	Epoch	Data Point	Number of times the model has seen the data point	Loss
1	1	c	0	5
2	1	b	0	5
3	1	a	0	5
4	2	b	1	4
5	2	a	1	4
6	2	c	1	4
7	3	c	2	3
8	3	b	2	3
9	3	a	2	3
10	4	a	3	2
11	4	b	3	2
12	4	c	3	2

Step	Epoch	Data Point	Number of times the model has seen the data point	Loss
1	1	c	0	5
2	1	b	0	5
3	1	a	0	5
4	2	b	1	4
5	2	a	1	4
6	2	c	1	4
7	3	c	2	3
8	3	b	2	3
9	3	a	2	3
10	4	a	3	2
11	4	b	3	2
12	4	c	3	2