Attention masks are ignored when using model.generate() in batch setting for GPT-2
See original GitHub issueEnvironment info
transformers
version: ‘3.3.1’ and ‘2.1.0’ (Tested on both)- Platform: Linux Azure VM
- Python version: 3.6.8
- PyTorch version (GPU?): 1.3.0 (Yes)
- Tensorflow version (GPU?): N/A
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using (Bert, XLNet …): GPT-2
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
import argparse
import logging
import os
import sys
import time
sys.path.append('transformers/src')
import numpy as np
import torch
import csv
import copy
from transformers import (
GPT2LMHeadModel,
GPT2Tokenizer
)
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
MODEL_CLASSES = {
"gpt2": (GPT2LMHeadModel, GPT2Tokenizer),
}
def set_seed():
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
def generate_sequences_parallel(model, tokenizer, orig_prompt_list):
set_seed()
proc_cnt = cpu_count() - 2
prompt_list = copy.deepcopy(orig_prompt_list)
max_seq_len = 128
requires_preprocessing = False
if not requires_preprocessing:
# GPT-2 doesn't require prepocess so we don't need to parallelize that
inputs = tokenizer(orig_prompt_list, add_special_tokens=False, return_tensors="pt", padding=True)
input_ids = inputs["input_ids"]
attn_masks = inputs["attention_mask"]
max_len_input_ids = max([len(input_id) for input_id in input_ids])
input_ids = input_ids.to('cuda')
attn_masks = attn_masks.to('cuda')
output_sequences = model.generate(
input_ids=input_ids,
max_length=10 + max_len_input_ids,
temperature=1.0,
top_k=0,
top_p=0.9,
repetition_penalty=1.0,
do_sample=True,
num_return_sequences=1,
attention_mask=attn_masks
)
return output_sequences
prompt_list_single = [['Good Morning Who is up with the sun Starting my morning routine with some Yoga and my mood was'], ['What do you all do to make it a great day and my mood was']]
prompt_list_batch = ['Good Morning Who is up with the sun Starting my morning routine with some Yoga and my mood was', 'What do you all do to make it a great day and my mood was']
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to('cuda')
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id
single = []
for elem in prompt_list_single:
single.append(generate_sequences_parallel(model, tokenizer, elem))
print('BATCH')
print()
batch = generate_sequences_parallel(model, tokenizer, prompt_list_batch)
assert(torch.eq(single[0],batch[0]))
assert(torch.eq(single[1],batch[1]))
Expected behavior
I expect the results of this script with batch size 1 to be the size as batch size 2 but it just ignores all the generated attention_ masks and position_ids. I’ve looked at #3021 and #3167 but those don’t seem to offer a concrete solution. Is there some way to use GPT-2’s batch generation?
Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Generation - Hugging Face
Loading and using a generation configuration file does not change a model configuration or weights. It only affects the model's behavior at generation...
Read more >padding and attention mask does not work as intended in ...
The Transformers library provides the encode_plus() and batch_encode_plus() ... generate the attention masks, and do padding for you.
Read more >What Are Attention Masks? - Luke Salamone's Blog
TLDR: Attention masks allow us to send a batch into the transformer even when the examples in the batch have varying lengths.
Read more >Fine-tune a German GPT-2 Model with Tensorflow ... - Data Dive
Here, we move to an exciting new area: text generation with neural networks. Because our data set is not only extensive but also...
Read more >GPT2 Text Generation | Kaggle
We will be using the GPT-2 tokenizer to tokenize our flavor text data. ... time.time() total_train_loss = 0 model.train() for step, batch in ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
On further investigation, I found that if
do_sample
is set toFalse
, the batch generation works as expected but it fails with sampling. For my project, I’m trying to get diverse sentences from gpt2 using the same prompt, so sampling is very important. Is there a fix on the way for whendo_sample = True
?Hey @rohit497,
I checked your sample and the code seems to work fine! Here to reproduce my results:
The outputs look good so I think the attention_mask is correctly applied and batch generation works.
The reason that you the results are not identical is becasue you sample from two different distributions. When you pass a single example the softmax output has
batch_size=1
while when you use a batch the softmax output hasbatch_size=2
dimension. That means that the first time you sample from a(1, vocab_size)
distribution whereas the second time you sample from a(2, vocab_size)
distribution. Now while each part of(2, vocab_size)
is the same as for the single batch passes, the sampled output can differ becausetorch.multinomial
does not yield the same results IMO (maybe you can check that actually). I adapted the test slightly for which there was atorch.manual_seed()
for some reason which might be misleading. The test only checks for argmax as this is deterministic.Hope this helps.