gpt2 results with past_key_values not the same as when computed from scratch
See original GitHub issueSystem Info
transformers
version: 4.20.1- Platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.31
- Python version: 3.9.12
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu113 (False)
- Tensorflow version (GPU?): 2.9.1 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
@patil-suraj @patrickvonplaten @LysandreJik
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Below is a minimal example that reproduces this unexpected behavior I encountered while tinkering with past_key_values. Essentially when I cache keys and values from a padded batch and then use past_key_values to run forward on an additional token for each example in the batch, I get somewhat different results than if I just compute the whole inputs from scratch and look at the last tokens.
It seems that something is going wrong when past_key_values involves some padding, however I believe I am using attention_mask correctly by including the masking strategy that was used for past_key_values as specified in the docs.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
s = ["a b c", "l m n o"]
inputs1 = tokenizer(s, return_tensors='pt', padding=True)
outputs1 = model(**inputs1)
s = [" d", " p"]
inputs2 = tokenizer(s, return_tensors='pt', padding=True)
attention_mask = torch.cat((inputs1['attention_mask'], inputs2['attention_mask']), dim=1)
outputs2 = model(input_ids=inputs2['input_ids'], attention_mask=attention_mask, past_key_values=outputs1.past_key_values)
s = ["a b c d", "l m n o p"]
inputs_full = tokenizer(s, return_tensors='pt', padding=True)
outputs_full = model(**inputs_full)
assert torch.allclose(outputs2.logits[1,0],outputs_full.logits[1,-1]) # are second example last token logits the same? -> passes
assert torch.allclose(outputs2.logits[0,0], outputs_full.logits[0,-2]) # are first example last token logits the same? -> fails
Expected behavior
The expected behavior would be for the logits of given tokens to be the same regardless of whether past_key_values is used for preceding tokens or if the full inputs are computed from scratch.
Thanks so much for all your hard work on this great library!
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:5 (4 by maintainers)
Top GitHub Comments
Yep, I will have a look asap
On further inspection, I believe the source of the difference is the
position_ids
. When the batched and paddedpast_key_values
are used, the defaultposition_ids
are computed by this code:Because the past_length includes the padded parts of past_key_values, this will cause the
position_ids
for the new tokens to be different than if everything is computed from scratch.I tested and if you modify my minimal example in the original post with
position_ids = torch.tensor([[3],[4]],dtype=torch.int64)
and pass that to the model forward pass, both asserts now pass. So just manually specifying theposition_ids
solves this problem.