Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPT2 -- build_inputs_with_special_tokens lacking BOS and EOS tokens.

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): GPT-2

Language I am using the model on (English, Chinese …): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Script:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
encoded_dict = tokenizer.encode_plus(text="Hello I am Moin", add_special_tokens=True, \
    max_length=512, truncation_strategy="longest_first", pad_to_max_length=False, \
    return_tensors=None, return_token_type_ids=True, return_attention_mask=True, \
    return_overflowing_tokens=False, return_special_tokens_mask=False)

print(tokenizer.bos_token_id)
print(encoded_dict['input_ids'])

You should see that the input_ids do not include the bos_token_id. Shouldn’t encode_plus be doing this?

Expected behavior

The <|endoftext|> token would appear, since I included to add_special_tokens.

Environment info

transformers version:
Platform: Linux-4.15.0-54-generic-x86_64-with-debian-buster-sid
Python version: 3.7.2
PyTorch version (GPU?): 1.3.1 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Issue Analytics

State:
Created 4 years ago
Reactions:6
Comments:14 (6 by maintainers)

Top GitHub Comments

13reactions

xvr-hltcommented, Dec 9, 2020

Ran into this too – this seems like a bug to me, or at the least not intuitive behaviour.

If there’s a tokeniser that has an EOS token, and I encode with add_special_tokens=True, I’d expect it to include the eos token at the end of sentence.

6reactions

twadadacommented, Oct 19, 2021

@patrickvonplaten

Hi, I also believe that BOS should be prepended before an input sentence (w1, w2, …) for two reasons:

Without BOS, the model cannot calculate the probability of generating the first token, i.e. P(w1|BOS).
BOS also affects the probability of generating the following words, e.g. P(w2|w1) != P(w2|w1, BOS).

For the second point, see the following example:

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
inputs = tokenizer("<|endoftext|>This", return_tensors="pt")
# inputs: {'input_ids': tensor([[50256,  1212]]), 'attention_mask': tensor([[1, 1]])}
outputs = model(**inputs, labels=inputs["input_ids"])
tokenizer.convert_ids_to_tokens(outputs.logits[0][1].topk(20)[1])
# ['Ġis', 'Ġarticle', 'Ġpost', 'Ġweek', 'Ġpage', 'Ġstory', 'Ġyear', 'Ġwas', 'Ġmonth', 'Ġsite', 'Ġbook', 'Ġpast', 'Ġitem', 'Ġproject', 'Ġblog', 'Ġstudy', 'Ġsection', 'Ġmorning', 'Ġvideo', 'Ġgame']

inputs = tokenizer("This", return_tensors="pt")
# {'input_ids': tensor([[1212]]), 'attention_mask': tensor([[1]])}
outputs = model(**inputs, labels=inputs["input_ids"])
tokenizer.convert_ids_to_tokens(outputs.logits[0][0].topk(20)[1])
# ['Ġis', ',', '.', 'Ċ', "'s", 'Ġwas', 'Ġto', 'Ġand', 'Ġthe', 'Ġin', 'Ġhas', 'Ġof', 'Ġwill', 'Ġa', ':', 'Ġare', 'Ġcan', 'Ġ(', '-', 'Ġfor']

Comparing these two generations, the prediction with “<|endoftext|>” seems more accurate (e.g. Without BOS, some punctuations are predicted as the next word of “This”).

Due to the lack of documentation, I am not entirely sure if the “<|endoftext|>” token is actually used as a BOS token during training, but the following example suggests it may be the case.

inputs = tokenizer("<|endoftext|>", return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
tokenizer.convert_ids_to_tokens(outputs.logits[0][0].topk(20)[1])

# ['Ċ', 'The', '"', 'A', 'I', 'In', '.', 'It', 'S', 'This', 'B', '-', 'C', 'We', '1', 'T', "'", 'P', '(', 'G']

Even if you opt not to prepend BOS, I believe these things should be clarified more in the documentation.