question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Whisper doesn't compute positional embeddings properly when given batches of prompt tokens

See original GitHub issue

System Info

v4.25.1 on M1 Mac with python 3.8

Who can help?

@sanchit-gandhi @patrickvonplaten @anton-l

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

When we want to run Whisper generation for a batch of samples with different prompt lengths (prefix tokens given to the decoder), positional embeddings for the decoder are improperly computed. It assumes all sequences have the same past_key_values_length, but this is not true in general.

Scenario: decoder_input_ids = [50361, 45431, 2584, 28682, 13, 50258, 50257, 50257] ("<|startofprev|>Something completely irrelevant.<|startoftranscript|><|pad|><|pad|>")

model.generate(input_features, decoder_input_ids=decoder_input_ids, decoder_attention_mask=decoder_attention_mask) will not give the correct output because at the beginning of decoding, the pad tokens won’t be taken into account that the positional embedding will be off.

Expected behavior

Instead of tracking past_key_values_length, it should use the attention mask to compute position ids. The current implementation is more based off of encoder-decoder architectures that would never do decoder prompting, but it should take more inspiration from decoder-only models to handle prompting. This is done for the Flax implementation in #20479

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Dec 6, 2022
0reactions
andyehrenbergcommented, Dec 9, 2022

@hannan72’s issue is separate to what I’m describing. But yes, padding should always be max_length - the issue I’m describing arises as a result of pad tokens being added to shorter sequences in batches (and won’t raise any errors - it’s just that Whisper’s handling of multiple sequence lengths under the hood is flawed and would be fixed by computing position_ids based off attention_mask).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pipelines
Utility factory method to build a Pipeline. Pipelines are made of: A tokenizer in charge of mapping raw textual input to token. A...
Read more >
How Positional Embeddings work in Self-Attention (code ...
Understand how positional embeddings emerged and how we use the inside self-attention to model highly structured data such as images.
Read more >
A New Language Processing Task and Machine Learning ...
4.5 BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.
Read more >
CRFM Benchmarking
religion has given birth to several other religions that could not be ... 0 | truncated: 0 | # prompt tokens: 2049 |...
Read more >
Subtitle Edit - Help/FAQ
You will only see position changes in SE if you use "mpv" as video player - see Options -> Settings -> Video player...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found