Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pegasus tokenizer does not have bos token, cannot pretrain

See original GitHub issue

Environment info

transformers version:
Platform: Ubuntu 18.04
Python version: 3.8
PyTorch version (GPU?): 1.7.1 with GPU (CUDA 10.1)
Tensorflow version (GPU?): N/A
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten @LysandreJik

Information

Model I am using (Bert, XLNet …): Pegasus

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: SQUaD
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I am trying to re-create the basic objective of pre-training with Pegasus.

I believe the issue is with the bos token: it does not exist, as per this PR: https://github.com/huggingface/transformers/pull/8731/files. However, it does exist in the original paper (it’s <s>)

Steps to reproduce:

model_name = 'google/pegasus-cnn_dailymail'
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model = PegasusForConditionalGeneration.from_pretrained(model_name)
## Taken from paper:
input_string = ["Pegasus is <mask_2> . <mask_1> it <mask_2> the model ."]
input_ids = tokenizer(input_string, add_special_tokens=False, return_tensors="pt").input_ids
print(input_ids) ## tensor([[51881,   117,     3,   110,   107,     2,   126,     3,   109,   861, 110,   107]])

decoder_input_string = ["<s> It is pure white . "]
decoder_input_ids = tokenizer(decoder_input_string, add_special_tokens=False, return_tensors="pt", bos_token='<s>').input_ids
print(decoder_input_ids) ## tensor([[ 110,  105,  116, 2314,  168,  117, 3763,  695,  110,  107]])

labels_string = ["It is pure white . </s>"]
labels = tokenizer(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
print(labels) ## tensor([[ 168,  117, 3763,  695,  110,  107,    1]])

loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

In the final line, I get the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-32-13dda8f18c44> in <module>
----> 1 loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/transformers/models/pegasus/modeling_pegasus.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1285         if labels is not None:
   1286             loss_fct = CrossEntropyLoss()
-> 1287             masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))
   1288 
   1289         if not return_dict:

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
    959 
    960     def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 961         return F.cross_entropy(input, target, weight=self.weight,
    962                                ignore_index=self.ignore_index, reduction=self.reduction)
    963 

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   2466     if size_average is not None or reduce is not None:
   2467         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2468     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   2469 
   2470 

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2259 
   2260     if input.size(0) != target.size(0):
-> 2261         raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
   2262                          .format(input.size(0), target.size(0)))
   2263     if dim == 2:

ValueError: Expected input batch_size (10) to match target batch_size (7).

it seems to me like the issue is while tokenizing decoder_input_ids: the <s> gets tokenized as 4 different indexes 110, 105, 116, 2314 instead of just one. This is because there is no bos_token in the tokenizer.

Expected behavior

ValueError should not be thrown and the decoder_input_ids should have same length as labels, allowing model(...) call to work correctly.

Issue Analytics

State:
Created 2 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

patil-surajcommented, Jun 7, 2021

Hi @adivekar-utexas sorry to only answer now.

from transformers import PegasusTokenizer

tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-large")

src_text = "This is source text"
tgt_text = "This is target text"

inputs = tokenizer(src_text, return_tensors="pt")
inputs["labels"] = tokenizer(tgt_text, return_tensors="pt").input_ids

outputs = model(**inputs)

0reactions

KimJaehee0725commented, Dec 16, 2021

Hi I am trying to use GSG under Bart environment. If I understand right about the GSG, there must be multiple target sentences not one unlike above example.

So, is it right that the shape of decoder input and label shape as below(It there are 2 masked sentences)?

decoder input string : <pad> It is the first masked sentence.<pad>It is the second masked sentence.
decoder label string : It is the first masked sentence.<pad>It is the second masked sentence.</s>