Pegasus tokenizer does not have bos token, cannot pretrain
See original GitHub issueEnvironment info
transformers
version:- Platform: Ubuntu 18.04
- Python version: 3.8
- PyTorch version (GPU?): 1.7.1 with GPU (CUDA 10.1)
- Tensorflow version (GPU?): N/A
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
@patrickvonplaten @LysandreJik
Information
Model I am using (Bert, XLNet …): Pegasus
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: SQUaD
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
I am trying to re-create the basic objective of pre-training with Pegasus.
I believe the issue is with the bos token: it does not exist, as per this PR: https://github.com/huggingface/transformers/pull/8731/files. However, it does exist in the original paper (it’s <s>
)
Steps to reproduce:
model_name = 'google/pegasus-cnn_dailymail'
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model = PegasusForConditionalGeneration.from_pretrained(model_name)
## Taken from paper:
input_string = ["Pegasus is <mask_2> . <mask_1> it <mask_2> the model ."]
input_ids = tokenizer(input_string, add_special_tokens=False, return_tensors="pt").input_ids
print(input_ids) ## tensor([[51881, 117, 3, 110, 107, 2, 126, 3, 109, 861, 110, 107]])
decoder_input_string = ["<s> It is pure white . "]
decoder_input_ids = tokenizer(decoder_input_string, add_special_tokens=False, return_tensors="pt", bos_token='<s>').input_ids
print(decoder_input_ids) ## tensor([[ 110, 105, 116, 2314, 168, 117, 3763, 695, 110, 107]])
labels_string = ["It is pure white . </s>"]
labels = tokenizer(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
print(labels) ## tensor([[ 168, 117, 3763, 695, 110, 107, 1]])
loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]
In the final line, I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-32-13dda8f18c44> in <module>
----> 1 loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]
/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/transformers/models/pegasus/modeling_pegasus.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1285 if labels is not None:
1286 loss_fct = CrossEntropyLoss()
-> 1287 masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))
1288
1289 if not return_dict:
/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
959
960 def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 961 return F.cross_entropy(input, target, weight=self.weight,
962 ignore_index=self.ignore_index, reduction=self.reduction)
963
/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
2466 if size_average is not None or reduce is not None:
2467 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2468 return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
2469
2470
/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
2259
2260 if input.size(0) != target.size(0):
-> 2261 raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
2262 .format(input.size(0), target.size(0)))
2263 if dim == 2:
ValueError: Expected input batch_size (10) to match target batch_size (7).
it seems to me like the issue is while tokenizing decoder_input_ids
: the <s>
gets tokenized as 4 different indexes 110, 105, 116, 2314
instead of just one. This is because there is no bos_token in the tokenizer.
Expected behavior
ValueError should not be thrown and the decoder_input_ids
should have same length as labels
, allowing model(...)
call to work correctly.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
Hi @adivekar-utexas sorry to only answer now.
Hi I am trying to use GSG under Bart environment. If I understand right about the GSG, there must be multiple target sentences not one unlike above example.
So, is it right that the shape of decoder input and label shape as below(It there are 2 masked sentences)?