question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gpt2 results with past_key_values not the same as when computed from scratch

See original GitHub issue

System Info

  • transformers version: 4.20.1
  • Platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.31
  • Python version: 3.9.12
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.0+cu113 (False)
  • Tensorflow version (GPU?): 2.9.1 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@patil-suraj @patrickvonplaten @LysandreJik

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Below is a minimal example that reproduces this unexpected behavior I encountered while tinkering with past_key_values. Essentially when I cache keys and values from a padded batch and then use past_key_values to run forward on an additional token for each example in the batch, I get somewhat different results than if I just compute the whole inputs from scratch and look at the last tokens.

It seems that something is going wrong when past_key_values involves some padding, however I believe I am using attention_mask correctly by including the masking strategy that was used for past_key_values as specified in the docs.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

s = ["a b c", "l m n o"]
inputs1 = tokenizer(s, return_tensors='pt', padding=True)
outputs1 = model(**inputs1)

s = [" d", " p"]
inputs2 = tokenizer(s, return_tensors='pt', padding=True)
attention_mask = torch.cat((inputs1['attention_mask'], inputs2['attention_mask']), dim=1)
outputs2 = model(input_ids=inputs2['input_ids'], attention_mask=attention_mask, past_key_values=outputs1.past_key_values)

s = ["a b c d", "l m n o p"]
inputs_full = tokenizer(s, return_tensors='pt', padding=True)
outputs_full = model(**inputs_full)

assert torch.allclose(outputs2.logits[1,0],outputs_full.logits[1,-1]) # are second example last token logits the same? -> passes
assert torch.allclose(outputs2.logits[0,0], outputs_full.logits[0,-2]) # are first example last token logits the same? -> fails

Expected behavior

The expected behavior would be for the logits of given tokens to be the same regardless of whether past_key_values is used for preceding tokens or if the full inputs are computed from scratch.

Thanks so much for all your hard work on this great library!

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ArthurZuckercommented, Sep 27, 2022

Yep, I will have a look asap

1reaction
jagnussoncommented, Jul 12, 2022

On further inspection, I believe the source of the difference is the position_ids. When the batched and padded past_key_values are used, the default position_ids are computed by this code:

      if past_key_values is None:
            past_length = 0
            past_key_values = tuple([None] * len(self.h))
      else:
          past_length = past_key_values[0][0].size(-2)
      if position_ids is None:
          position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
          position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])

Because the past_length includes the padded parts of past_key_values, this will cause the position_ids for the new tokens to be different than if everything is computed from scratch.

I tested and if you modify my minimal example in the original post with position_ids = torch.tensor([[3],[4]],dtype=torch.int64) and pass that to the model forward pass, both asserts now pass. So just manually specifying the position_ids solves this problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenAI GPT2 - Hugging Face
The model can take the past_key_values (for PyTorch) or past (for TF) as input, which is the previously computed key/value attention pairs.
Read more >
confusion about past_key_values in GPT2 #15700 - GitHub
I debugged the code, and I suppose I have understood the idea of this amazing trick,I mean, for the auto-regressive model,when generating the ......
Read more >
CLRP : GPT2 implementation ‍ | Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources.
Read more >
Better Language Models and Their Implications - OpenAI
GPT-2 is a large transformer-based language model with 1.5 billion ... and translation, we are able to get surprising results without any ...
Read more >
Ilya Sutskever - GPT-2 - YouTube
Presented at the Matroid Scaled Machine Learning Conference 2019Venue: Computer History Museum scaledml.org | #scaledml2019.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found