GPT2 (pre-trained not fine-tuned) only generates additional special tokens
See original GitHub issueEnvironment info
transformers
version: 3.5.0- Platform: Darwin-19.6.0-x86_64-i386-64bit
- Python version: 3.6.3
- PyTorch version (GPU?): 1.7.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using (GPT2 / DistilGPT2):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
I’m using GPT2 or DistilGPT2 on MetalWOZ and the issue I’m having is when I add special tokens (even bos, eos, etc) and prompt the model, it only generates those special tokens - no other token. For example, if I add the tokens <USER> and <SYSTEM> and prompt the model with:
“I want a pepperoni pizza with mushroom”
I get:
“I want a pepperoni pizza with mushroom <USER> <USER> <USER> <SYSTEM> <USER> <USER> <USER> <SYSTEM> <USER> <USER>”
To reproduce
Steps to reproduce the behavior:
- Add special tokens to a GPT2 model (example below with distilgpt2 but I get the same behavior with gpt2)
- Resize embeddings
- Prompt model
import torch
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
tokenizer.add_special_tokens(
{'additional_special_tokens': ['<USER>', '<SYSTEM>']}
)
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
model.resize_token_embeddings(len(tokenizer))
inp_tok_ids = tokenizer.encode('I want a pepperoni pizza with mushroom')
inp_tensor = torch.LongTensor(inp_tok_ids).unsqueeze(0)
model.eval()
with torch.no_grad():
for i in range(10):
outputs = model(inp_tensor)
logits = outputs[0][:, -1, :]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
inp_tensor = torch.cat([inp_tensor, next_token.unsqueeze(-1)], dim=-1)
print(tokenizer.decode(inp_tensor[0]))
Expected behavior
I would expect a mix of the new special tokens and other tokens.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (4 by maintainers)
Top Results From Across the Web
OpenAI GPT2 - Hugging Face
A blog on how to Finetune a non-English GPT-2 Model with Hugging Face. A blog on How to generate text: using different decoding...
Read more >How to add all standard special tokens to my hugging face ...
I am confident this is because the original T5 model was trained only with these special tokens (no BOS, no MASK, no CLS)....
Read more >Fine-tune a German GPT-2 Model with Tensorflow ... - Data Dive
We build a model that can be prompted to generate human like positive and negative medical reviews in German. For that, we fine-tune...
Read more >Text generation with GPT-2 - Model Differently
We define the start and end tokens of the headlines and add them: to the tokenizer as special tokens and; to pre-trained model...
Read more >Confused about whether I can add special tokens to a ... - Reddit
Confused about whether I can add special tokens to a pretrained GPT-2 tokenizer. I am using transformers/simpletransformers.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks @LysandreJik and @patrickvonplaten! I like @g-karthik suggestion, it would be nice for this bevahiour to happen automatically
@patrickvonplaten yes, I was thinking I’ll try and estimate the mean and covariance of the set of values in GPT-2’s pre-trained embeddings (across all of its 4 model sizes), assuming a Gaussian distribution. And then update the random initialization’s mean and std. dev. accordingly in the model’s
_init_weights()
. That way, the random initialization comes from a distribution that’s effectively “similar” to that of the pre-trained vectors, and hence decoding sequences would result in a mixture of the original tokens and added tokens.