Different result in AutoModelForCausalLM
See original GitHub issue🚀 Feature request
Models inside AutoModelForCausalLM have different behavior on loss calculation.
In BartForCausalLM there is no shift in loss calculation https://github.com/huggingface/transformers/blob/b013842244df7be96b8cc841491bd1e35e475e36/src/transformers/models/bart/modeling_bart.py#L1745
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
In RobertaForCausalLM A shift is applied before loss calculation https://github.com/huggingface/transformers/blob/b013842244df7be96b8cc841491bd1e35e475e36/src/transformers/models/roberta/modeling_roberta.py#L944
# we are doing next-token prediction; shift prediction scores and input ids by one
shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
labels = labels[:, 1:].contiguous()
loss_fct = CrossEntropyLoss()
lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
Motivation
I found a mistake when I switched the config from Roberta to BART in AutoModelForCausalLM. It turns out to be different labeling in loss. So, It would be nice to make CausalLM models handle label in the same way, either shift or not.
Your contribution
I can make a PR to make sure that all the models will have a shift prediction.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
BartForCausalLM
does acceptlabels==input_id
, in general, all the decoders inEncoderDecoder
accept that and that’s what we have documented, pass the same input aslabels
anddecoder_input_ids
.The reason I suggested using
shift_tokens_right
, because BART useseos
asdecoder_start_token
which theshift_tokens_right
function handles. This is different fromRobertaForCausalm, GPT2LMHeadModel ...
Hmm, I’m not 100% whether everybody is on the same page here.
BartForCausalLM
was mostly created to be used in combination withEncoderDecoderModel
and not as a standalone model. Also, Roberta requires bothinput_ids
andlabels
as an input to correctly calculate the loss - the difference is just that thatinput_ids
should be equal tolabels
with the labels being shifted under-the-hood. This is not the same thing as theshift_tokens_right
function, which fully generates thedecoder_input_ids
from the labels…I think I would be fine with changing the behavior of
BartForCausalLM
so thatlabels==input_ids
can be input to the function, even if this would be a slight breaking change. It would alignBartForCausalLM
closer withRobertaForCausalm, GPT2LMHeadModel, ...
which would then also allowEncoderDecoderModel
to have a generalshift_tokens
function.Does this make sense?