T5 special tokens not mapped to unique indices in vocabulary
See original GitHub issueThe docs recommend adding the special eos_token <\s>
to the end of each string when encoding/decoding with T5Tokenizer
. However, this (and the other special tokens e.g. unk_token
, pad_token
aren’t assigned unique ids in the lookup vocabulary (they are mapped to {0,1,2}
, which are indices for other common words in the vocab). In practice, I find my model fails to properly produce the eos_token
since it is associated with blank spaces, so the model produces run-ons during generation
To reproduce
>>> from transformers import T5Tokenizer
>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')
>>> tokenizer.pad_token
'<pad>'
>>> tokenizer.pad_token_id
0
>>> tokenizer.eos_token
'</s>'
>>> tokenizer.eos_token_id
1
>>> tokenizer.unk_token
'<unk>'
>>> tokenizer.unk_token_id
2
>>> tokenizer.decode([0])
''
>>> tokenizer.decode([1])
''
>>> tokenizer.decode([2])
' ⁇ '
Expected behavior
>>> tokenizer.decode([0])
'<pad>'
>>> tokenizer.decode([1])
'</s>'
>>> tokenizer.decode([2])
'<unk>'
Environment info
transformers
version: 2.9.1
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:17 (13 by maintainers)
Top Results From Across the Web
T5 — transformers 2.11.0 documentation - Hugging Face
Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you should be able to pad...
Read more >How to add new special token to the tokenizer? - Stack Overflow
How to map token indices from the SQuAD data to tokens from BERT tokenizer? 1 · How to add new token to T5...
Read more >token2index: A lightweight but powerful library for token ...
The library is fully tested, and does not require any additional requirements. ... Return the dictionary mapping tokens to unique indices.
Read more >Transformer Memory as a Differentiable Search Index - arXiv
Information retrieval (IR) systems map a user query q ∈ Q to a ranked list ... a unique token, and some simple baselines...
Read more >Code Review: How the AllenNLP Vocabulary indexes your text
Another reason is that token in Textfield may not match the granularity of the token/id mapping in Vocabulary. For example, we tokenize text...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
cc @danyaljj I’m just going to consolidate discussion from (#7796) here. (Also relevant is HF forum)
max_src_len
above is the maximum length of any input sequence, counted in…wait for it…number of characters. Whoops. That was dumb. I intended to go through and find the maximum sequence length in tokens. I’ll fix that, but I don’t think it affects other things: it turns out thatmax_src, max_tgt_len = (250, 250)
for the inputs I was using. But that just means we had a lot of padding.I was using finetune.py just last month, so I don’t think it was the EOS token.
The “gibberish” generation still occurs if I just use finetune_t5.sh as written. If I do either of the following, the outputs are correct:
use_task_specific_params(self.model, "summarization")
in finetune.pygenerate
call:This is because config.json for t5-small and t5-base has the following (@danyaljj this is also the answer to our question about where prefix is getting picked up in the HF forum)
But it looks like the only param that really matters was the min_length. Beam size, max_length, prefix, etc all weren’t causing the problem. I verified on both the (sent) => (sent) copy and the (sent) => (first word of sent) tasks.
So at least for my use case it seems like the tokenizer decode bug was not causing a problem? It seems like even though we, as ~users couldn’t decode tokens correctly, the model still knew that 1==EOS and that after an EOS it should print PAD. The problem was that we were forcing it to generate at least 30 tokens, hence all the gibberish that I was seeing.
@sshleifer, does this make sense with your understanding of the finetune script? i.e., that failing to decode EOS shouldn’t matter?
@danyaljj, given that you wanted relatively short outputs of the answers to questions, this seems like it might fix the issue for you? Give it a try and see what happens?
For anyone looking for a quick, temporary fix to the unending-generation problem: override the EOS token with a custom one (note this fix does not work for
unk_token
orpad_token
; for some reason they can’t be re-mapped)