Training loss of BART is going to nan in transformers>=4.21.0
See original GitHub issueSystem Info
transformers==4.20.1 and transformers>=4.21.0 torch==1.12.1
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Hi, I’m using a huge dataset, so it is hard to show how to reproduce my problem. I’m using BART pre-trained model and trying to fine-tune the model as a translation model. But, the training loss is completely differently calculated depending on transformers’s versions.
My pseudo code is like:
net = BartForConditionalGeneration.from_pretrained("gogamza/kobart-base-v1").to(rank)
net.train()
with amp.autocast(enabled=True):
output = net(
input_ids=input_ids,
attention_mask=attention_mask,
decoder_input_ids=decoder_input_ids,,
decoder_attention_mask=decoder_attention_mask,
)
# draw graphs of output.loss
I drawed the graph (training loss by iterations) using wandb.
effortless-water-23
(green): transformers>=4.21.0
swept-tree-24
(pink): transformers==4.20.1
swept-tree-24
was slowly coverged to zero, but effortless-water-23
eventually got nan at 80k+ iterations. (The above graph didn’t show that.)
I’ve searched the difference between transformers>=4.21.0
and transformers==4.20.1
especially about BART, and I’m suspicious this part.
So as I reverted the part in transformers>=4.21.0
like:
# mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min))
mask = torch.full((tgt_len, tgt_len), torch.tensor(float("-inf")))
the problem was gone. (The result is same as swept-tree-24
used transformers==4.20.1
)
Anyway my problem is solved, but I’m wondering what is the real cause of the problem. Thanks in advance.
Expected behavior
I explained this part in the reproduction part.
Issue Analytics
- State:
- Created a year ago
- Comments:10 (1 by maintainers)
Top GitHub Comments
@soocheolnoh I am very happy that you found the cause and a solution! Also appreciate a lot your effort on the detailed issue description and further investigations!
It’s better to use the same decoder start token as the one used in pretraining. Regarding using pad token, it might work in some case, but we should be very careful. I believe in the case of this issue, it might be related to the decoder attention mask. I saw you have
So when you used pad token as decoder start token, you prepared a (decoder) attention mask that ignores the decoder start token (which shouldn’t be ignored).
Of course, this is just one observation that might be related.
I am going to close the issue. Don’t hesitate to reopen if you still have further questions.
Thank you! @ydshieh