low loss in fine tuning but generated answers are not correct
See original GitHub issueHi, I am fine tuning a QA dataset using huggingface unified v2 t5 large, and the sample code is like below
# training
model_inputs = self.tokenizer(questions,
padding=True, truncation=True,
max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)
with self.tokenizer.as_target_tokenizer():
labels = self.tokenizer(answers,
padding=True, truncation=True,
max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)
# ignore pad token for loss
labels["input_ids"][
labels["input_ids"] == self.tokenizer.pad_token_id
] = -100
model_inputs["labels"] = labels["input_ids"]
outputs = self.model(**model_inputs)
loss = outputs.loss
# generate
model_inputs = self.tokenizer(questions,
padding=True, truncation=True,
max_length=self.tokenizer.model_max_length, return_tensors="pt").to(device)
sampled_outputs = self.model.generate(**model_inputs,
num_beams=4, max_length=50, early_stopping=True)
I can get fairly low loss (0.41) after fine tuning for around 5 epochs, yet the generated answers are mostly wrong (0.23 accuracy). According to T5 doc it seems that generate
can handle the prepending of pad token. Also, the generated answers indeed belong to one of the choices, it is just that they are not the correct ones.
I am wondering what might be the issue. Thanks!
Issue Analytics
- State:
- Created a year ago
- Comments:7
Top Results From Across the Web
What should I do when my neural network doesn't learn?
This can be done by comparing the segment output to what you know to be the correct answer. This is called unit testing....
Read more >Fine-Tuning DistilBertForSequenceClassification: Is not ...
Looking at running loss and minibatch loss is easily misleading. ... Tuning and fine-tuning ML models are difficult work.
Read more >DataCamp/4 - Fine-tuning keras models.py at master - GitHub
You'll now try optimizing a model at a very low learning rate, a very high learning rate, and a "just right" learning rate....
Read more >On the Stability of Fine-tuning BERT - OpenReview
The paper focuses on the instability phenomenon happening in the fine-tuning of BERT-like models in downstream tasks. The reasons of such instability were ......
Read more >bigscience/bloom · Fine-tune the model? - Hugging Face
Hi everyone, If you have enough compute you could fine tune BLOOM on any downstream task but you would need enough GPU RAM...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Oh I see. Regardless, I think the lesson I learned is that if the performance is not correlated with the loss we can give unifiedqa a longer training epochs/steps. Thank you for the help all the way @danyaljj!!
Thanks @danyaljj! After a week’s attempt I think I somehow solved this problem. In my case, it seems that fine tuning more epochs will work. Previously I was fine tuning either 5 or 10 epochs, and got 0.23 accuracy. When fine tuning for 50 epochs, I can get 0.72 accuracy. I wonder that in your paper did you also fine tune with large epoch? Thanks!!