GPT2 -- build_inputs_with_special_tokens lacking BOS and EOS tokens.
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): GPT-2
Language I am using the model on (English, Chinese …): English
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
Script:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
encoded_dict = tokenizer.encode_plus(text="Hello I am Moin", add_special_tokens=True, \
max_length=512, truncation_strategy="longest_first", pad_to_max_length=False, \
return_tensors=None, return_token_type_ids=True, return_attention_mask=True, \
return_overflowing_tokens=False, return_special_tokens_mask=False)
print(tokenizer.bos_token_id)
print(encoded_dict['input_ids'])
You should see that the input_ids
do not include the bos_token_id
. Shouldn’t encode_plus
be doing this?
Expected behavior
The <|endoftext|> token would appear, since I included to add_special_tokens
.
Environment info
transformers
version:- Platform: Linux-4.15.0-54-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.2
- PyTorch version (GPU?): 1.3.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Issue Analytics
- State:
- Created 4 years ago
- Reactions:6
- Comments:14 (6 by maintainers)
Top Results From Across the Web
GPT2Tokenizer not putting bos/eos token - Intermediate
Hello, I am working with a pretrained tokenizer (MiriUll/gpt2-wechsel-german_easy · Hugging Face) that has the bos_token and eos_token set.
Read more >What's the right input for gpt-2 in NLP
So my question is how to add special tokens to get the right input format. Currently I'm thinking doing like this: example1 <BOS>...
Read more >Hugging face - Efficient tokenization of unknown token in GPT2
I am trying to train a dialog system using GPT2. For tokenization, I am using the following configuration for adding the special tokens....
Read more >Eos in source segments - Support - OpenNMT Forum
2022-01-06 19:37:52.291000: I inputter.py:318] - special tokens: BOS=no, EOS=no 2022-01-06 19:37:52.320000: I inputter.py:318] Initialized ...
Read more >CLRP : GPT2 implementation | Kaggle
OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners by Alec ... But we want to define the PAD token...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ran into this too – this seems like a bug to me, or at the least not intuitive behaviour.
If there’s a tokeniser that has an EOS token, and I encode with
add_special_tokens=True
, I’d expect it to include the eos token at the end of sentence.@patrickvonplaten
Hi, I also believe that BOS should be prepended before an input sentence (w1, w2, …) for two reasons:
For the second point, see the following example:
Comparing these two generations, the prediction with “<|endoftext|>” seems more accurate (e.g. Without BOS, some punctuations are predicted as the next word of “This”).
Due to the lack of documentation, I am not entirely sure if the “<|endoftext|>” token is actually used as a BOS token during training, but the following example suggests it may be the case.
Even if you opt not to prepend BOS, I believe these things should be clarified more in the documentation.