question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pegasus tokenizer does not have bos token, cannot pretrain

See original GitHub issue

Environment info

  • transformers version:
  • Platform: Ubuntu 18.04
  • Python version: 3.8
  • PyTorch version (GPU?): 1.7.1 with GPU (CUDA 10.1)
  • Tensorflow version (GPU?): N/A
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten @LysandreJik

Information

Model I am using (Bert, XLNet …): Pegasus

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: SQUaD
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I am trying to re-create the basic objective of pre-training with Pegasus.

I believe the issue is with the bos token: it does not exist, as per this PR: https://github.com/huggingface/transformers/pull/8731/files. However, it does exist in the original paper (it’s <s>)

Steps to reproduce:

model_name = 'google/pegasus-cnn_dailymail'
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model = PegasusForConditionalGeneration.from_pretrained(model_name)
## Taken from paper:
input_string = ["Pegasus is <mask_2> . <mask_1> it <mask_2> the model ."]
input_ids = tokenizer(input_string, add_special_tokens=False, return_tensors="pt").input_ids
print(input_ids) ## tensor([[51881,   117,     3,   110,   107,     2,   126,     3,   109,   861, 110,   107]])

decoder_input_string = ["<s> It is pure white . "]
decoder_input_ids = tokenizer(decoder_input_string, add_special_tokens=False, return_tensors="pt", bos_token='<s>').input_ids
print(decoder_input_ids) ## tensor([[ 110,  105,  116, 2314,  168,  117, 3763,  695,  110,  107]])

labels_string = ["It is pure white . </s>"]
labels = tokenizer(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
print(labels) ## tensor([[ 168,  117, 3763,  695,  110,  107,    1]])

loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

In the final line, I get the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-32-13dda8f18c44> in <module>
----> 1 loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/transformers/models/pegasus/modeling_pegasus.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1285         if labels is not None:
   1286             loss_fct = CrossEntropyLoss()
-> 1287             masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))
   1288 
   1289         if not return_dict:

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
    959 
    960     def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 961         return F.cross_entropy(input, target, weight=self.weight,
    962                                ignore_index=self.ignore_index, reduction=self.reduction)
    963 

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   2466     if size_average is not None or reduce is not None:
   2467         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2468     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   2469 
   2470 

/home/ubuntu/anaconda3/envs/pytorch_new/lib/python3.8/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2259 
   2260     if input.size(0) != target.size(0):
-> 2261         raise ValueError('Expected input batch_size ({}) to match target batch_size ({}).'
   2262                          .format(input.size(0), target.size(0)))
   2263     if dim == 2:

ValueError: Expected input batch_size (10) to match target batch_size (7).

it seems to me like the issue is while tokenizing decoder_input_ids: the <s> gets tokenized as 4 different indexes 110, 105, 116, 2314 instead of just one. This is because there is no bos_token in the tokenizer.

Expected behavior

ValueError should not be thrown and the decoder_input_ids should have same length as labels, allowing model(...) call to work correctly.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
patil-surajcommented, Jun 7, 2021

Hi @adivekar-utexas sorry to only answer now.

from transformers import PegasusTokenizer

tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-large")

src_text = "This is source text"
tgt_text = "This is target text"

inputs = tokenizer(src_text, return_tensors="pt")
inputs["labels"] = tokenizer(tgt_text, return_tensors="pt").input_ids

outputs = model(**inputs)
0reactions
KimJaehee0725commented, Dec 16, 2021

Hi I am trying to use GSG under Bart environment. If I understand right about the GSG, there must be multiple target sentences not one unlike above example.

So, is it right that the shape of decoder input and label shape as below(It there are 2 masked sentences)?

  • decoder input string : <pad> It is the first masked sentence.<pad>It is the second masked sentence.
  • decoder label string : It is the first masked sentence.<pad>It is the second masked sentence.</s>
Read more comments on GitHub >

github_iconTop Results From Across the Web

Pegasus - Hugging Face
Pegasus' pretraining task is intentionally similar to summarization: ... from a sequence by adding eos to the end. no bos token is added...
Read more >
Newest 'huggingface-tokenizers' Questions - Stack Overflow
Can't load pretrained model from local directory. I finetuned a huggingface model on google colab, saved it with trainer.save_model('.
Read more >
PEGASUS: Pre-training with Extracted Gap-sentences ... - arXiv
This would present a problem for position em- beddings which would never be updated for longer input lengths, but we confirm the postulation...
Read more >
SentencePiece Tokenizer Demystified | by Jonathan Kernes
It can work at the byte level, so you **almost** never need to use [UNK] or [OOV] tokens. This is not specific only...
Read more >
Create a Tokenizer and Train a Huggingface RoBERTa Model ...
This blog post is the first part of a series where we want to create a product names generator using a transformer model....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found