Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gpt2 results with past_key_values not the same as when computed from scratch

See original GitHub issue

System Info

transformers version: 4.20.1
Platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.31
Python version: 3.9.12
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.0+cu113 (False)
Tensorflow version (GPU?): 2.9.1 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@patil-suraj @patrickvonplaten @LysandreJik

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Below is a minimal example that reproduces this unexpected behavior I encountered while tinkering with past_key_values. Essentially when I cache keys and values from a padded batch and then use past_key_values to run forward on an additional token for each example in the batch, I get somewhat different results than if I just compute the whole inputs from scratch and look at the last tokens.

It seems that something is going wrong when past_key_values involves some padding, however I believe I am using attention_mask correctly by including the masking strategy that was used for past_key_values as specified in the docs.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

s = ["a b c", "l m n o"]
inputs1 = tokenizer(s, return_tensors='pt', padding=True)
outputs1 = model(**inputs1)

s = [" d", " p"]
inputs2 = tokenizer(s, return_tensors='pt', padding=True)
attention_mask = torch.cat((inputs1['attention_mask'], inputs2['attention_mask']), dim=1)
outputs2 = model(input_ids=inputs2['input_ids'], attention_mask=attention_mask, past_key_values=outputs1.past_key_values)

s = ["a b c d", "l m n o p"]
inputs_full = tokenizer(s, return_tensors='pt', padding=True)
outputs_full = model(**inputs_full)

assert torch.allclose(outputs2.logits[1,0],outputs_full.logits[1,-1]) # are second example last token logits the same? -> passes
assert torch.allclose(outputs2.logits[0,0], outputs_full.logits[0,-2]) # are first example last token logits the same? -> fails

Expected behavior

The expected behavior would be for the logits of given tokens to be the same regardless of whether past_key_values is used for preceding tokens or if the full inputs are computed from scratch.

Thanks so much for all your hard work on this great library!

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

ArthurZuckercommented, Sep 27, 2022

Yep, I will have a look asap

1reaction

jagnussoncommented, Jul 12, 2022

On further inspection, I believe the source of the difference is the position_ids. When the batched and padded past_key_values are used, the default position_ids are computed by this code:

      if past_key_values is None:
            past_length = 0
            past_key_values = tuple([None] * len(self.h))
      else:
          past_length = past_key_values[0][0].size(-2)
      if position_ids is None:
          position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
          position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])

Because the past_length includes the padded parts of past_key_values, this will cause the position_ids for the new tokens to be different than if everything is computed from scratch.

I tested and if you modify my minimal example in the original post with position_ids = torch.tensor([[3],[4]],dtype=torch.int64) and pass that to the model forward pass, both asserts now pass. So just manually specifying the position_ids solves this problem.