Different result in AutoModelForCausalLM
See original GitHub issue🚀 Feature request
Models inside AutoModelForCausalLM have different behavior on loss calculation.
In BartForCausalLM there is no shift in loss calculation https://github.com/huggingface/transformers/blob/b013842244df7be96b8cc841491bd1e35e475e36/src/transformers/models/bart/modeling_bart.py#L1745
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
In RobertaForCausalLM A shift is applied before loss calculation https://github.com/huggingface/transformers/blob/b013842244df7be96b8cc841491bd1e35e475e36/src/transformers/models/roberta/modeling_roberta.py#L944
# we are doing next-token prediction; shift prediction scores and input ids by one
shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
labels = labels[:, 1:].contiguous()
loss_fct = CrossEntropyLoss()
lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
Motivation
I found a mistake when I switched the config from Roberta to BART in AutoModelForCausalLM. It turns out to be different labeling in loss. So, It would be nice to make CausalLM models handle label in the same way, either shift or not.
Your contribution
I can make a PR to make sure that all the models will have a shift prediction.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)

Top Related StackOverflow Question
BartForCausalLMdoes acceptlabels==input_id, in general, all the decoders inEncoderDecoderaccept that and that’s what we have documented, pass the same input aslabelsanddecoder_input_ids.The reason I suggested using
shift_tokens_right, because BART useseosasdecoder_start_tokenwhich theshift_tokens_rightfunction handles. This is different fromRobertaForCausalm, GPT2LMHeadModel ...Hmm, I’m not 100% whether everybody is on the same page here.
BartForCausalLMwas mostly created to be used in combination withEncoderDecoderModeland not as a standalone model. Also, Roberta requires bothinput_idsandlabelsas an input to correctly calculate the loss - the difference is just that thatinput_idsshould be equal tolabelswith the labels being shifted under-the-hood. This is not the same thing as theshift_tokens_rightfunction, which fully generates thedecoder_input_idsfrom the labels…I think I would be fine with changing the behavior of
BartForCausalLMso thatlabels==input_idscan be input to the function, even if this would be a slight breaking change. It would alignBartForCausalLMcloser withRobertaForCausalm, GPT2LMHeadModel, ...which would then also allowEncoderDecoderModelto have a generalshift_tokensfunction.Does this make sense?