question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 special tokens not mapped to unique indices in vocabulary

See original GitHub issue

The docs recommend adding the special eos_token <\s> to the end of each string when encoding/decoding with T5Tokenizer. However, this (and the other special tokens e.g. unk_token, pad_token aren’t assigned unique ids in the lookup vocabulary (they are mapped to {0,1,2}, which are indices for other common words in the vocab). In practice, I find my model fails to properly produce the eos_token since it is associated with blank spaces, so the model produces run-ons during generation

To reproduce

>>> from transformers import T5Tokenizer
>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')
>>> tokenizer.pad_token
'<pad>'
>>> tokenizer.pad_token_id
0
>>> tokenizer.eos_token
'</s>'
>>> tokenizer.eos_token_id
1
>>> tokenizer.unk_token
'<unk>'
>>> tokenizer.unk_token_id
2
>>> tokenizer.decode([0])
''
>>> tokenizer.decode([1])
''
>>> tokenizer.decode([2])
' ⁇ '

Expected behavior

>>> tokenizer.decode([0])
'<pad>'
>>> tokenizer.decode([1])
'</s>'
>>> tokenizer.decode([2])
'<unk>'

Environment info

  • transformers version: 2.9.1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:3
  • Comments:17 (13 by maintainers)

github_iconTop GitHub Comments

4reactions
jsroznercommented, Nov 10, 2020

cc @danyaljj I’m just going to consolidate discussion from (#7796) here. (Also relevant is HF forum)

max_src_len above is the maximum length of any input sequence, counted in…wait for it…number of characters. Whoops. That was dumb. I intended to go through and find the maximum sequence length in tokens. I’ll fix that, but I don’t think it affects other things: it turns out that max_src, max_tgt_len = (250, 250) for the inputs I was using. But that just means we had a lot of padding.

I was using finetune.py just last month, so I don’t think it was the EOS token.

The “gibberish” generation still occurs if I just use finetune_t5.sh as written. If I do either of the following, the outputs are correct:

  1. Comment out use_task_specific_params(self.model, "summarization") in finetune.py
  2. Add min_len to the generate call:
        generated_ids = self.model.generate(
            batch["input_ids"],
            attention_mask=batch["attention_mask"],
            use_cache=True,
            decoder_start_token_id=self.decoder_start_token_id,
            num_beams=self.eval_beams,
            max_length=self.eval_max_length,
            min_length=0
        )

This is because config.json for t5-small and t5-base has the following (@danyaljj this is also the answer to our question about where prefix is getting picked up in the HF forum)

  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },

But it looks like the only param that really matters was the min_length. Beam size, max_length, prefix, etc all weren’t causing the problem. I verified on both the (sent) => (sent) copy and the (sent) => (first word of sent) tasks.

So at least for my use case it seems like the tokenizer decode bug was not causing a problem? It seems like even though we, as ~users couldn’t decode tokens correctly, the model still knew that 1==EOS and that after an EOS it should print PAD. The problem was that we were forcing it to generate at least 30 tokens, hence all the gibberish that I was seeing.

@sshleifer, does this make sense with your understanding of the finetune script? i.e., that failing to decode EOS shouldn’t matter?

@danyaljj, given that you wanted relatively short outputs of the answers to questions, this seems like it might fix the issue for you? Give it a try and see what happens?

3reactions
sarahwiecommented, Jun 29, 2020

For anyone looking for a quick, temporary fix to the unending-generation problem: override the EOS token with a custom one (note this fix does not work for unk_token or pad_token; for some reason they can’t be re-mapped)

tokenizer = T5Tokenizer.from_pretrained('t5-base')
tokenizer.add_special_tokens({'eos_token':'[EOS]'})

model.resize_token_embeddings(len(tokenizer))

>>> tokenizer.eos_token_id
32100
Read more comments on GitHub >

github_iconTop Results From Across the Web

T5 — transformers 2.11.0 documentation - Hugging Face
Indices of input sequence tokens in the vocabulary. T5 is a model with relative position embeddings so you should be able to pad...
Read more >
How to add new special token to the tokenizer? - Stack Overflow
How to map token indices from the SQuAD data to tokens from BERT tokenizer? 1 · How to add new token to T5...
Read more >
token2index: A lightweight but powerful library for token ...
The library is fully tested, and does not require any additional requirements. ... Return the dictionary mapping tokens to unique indices.
Read more >
Transformer Memory as a Differentiable Search Index - arXiv
Information retrieval (IR) systems map a user query q ∈ Q to a ranked list ... a unique token, and some simple baselines...
Read more >
Code Review: How the AllenNLP Vocabulary indexes your text
Another reason is that token in Textfield may not match the granularity of the token/id mapping in Vocabulary. For example, we tokenize text...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found