Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 special tokens not mapped to unique indices in vocabulary

See original GitHub issue

The docs recommend adding the special eos_token <\s> to the end of each string when encoding/decoding with T5Tokenizer. However, this (and the other special tokens e.g. unk_token, pad_token aren’t assigned unique ids in the lookup vocabulary (they are mapped to {0,1,2}, which are indices for other common words in the vocab). In practice, I find my model fails to properly produce the eos_token since it is associated with blank spaces, so the model produces run-ons during generation

To reproduce

>>> from transformers import T5Tokenizer
>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')
>>> tokenizer.pad_token
'<pad>'
>>> tokenizer.pad_token_id
0
>>> tokenizer.eos_token
'</s>'
>>> tokenizer.eos_token_id
1
>>> tokenizer.unk_token
'<unk>'
>>> tokenizer.unk_token_id
2

>>> tokenizer.decode([0])
''
>>> tokenizer.decode([1])
''
>>> tokenizer.decode([2])
' ⁇ '

Expected behavior

>>> tokenizer.decode([0])
'<pad>'
>>> tokenizer.decode([1])
'</s>'
>>> tokenizer.decode([2])
'<unk>'

Environment info

transformers version: 2.9.1

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:17 (13 by maintainers)

Top GitHub Comments

4reactions

jsroznercommented, Nov 10, 2020

cc @danyaljj I’m just going to consolidate discussion from (#7796) here. (Also relevant is HF forum)

max_src_len above is the maximum length of any input sequence, counted in…wait for it…number of characters. Whoops. That was dumb. I intended to go through and find the maximum sequence length in tokens. I’ll fix that, but I don’t think it affects other things: it turns out that max_src, max_tgt_len = (250, 250) for the inputs I was using. But that just means we had a lot of padding.

I was using finetune.py just last month, so I don’t think it was the EOS token.

The “gibberish” generation still occurs if I just use finetune_t5.sh as written. If I do either of the following, the outputs are correct:

Comment out use_task_specific_params(self.model, "summarization") in finetune.py
Add min_len to the generate call:

        generated_ids = self.model.generate(
            batch["input_ids"],
            attention_mask=batch["attention_mask"],
            use_cache=True,
            decoder_start_token_id=self.decoder_start_token_id,
            num_beams=self.eval_beams,
            max_length=self.eval_max_length,
            min_length=0
        )

This is because config.json for t5-small and t5-base has the following (@danyaljj this is also the answer to our question about where prefix is getting picked up in the HF forum)

  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },

But it looks like the only param that really matters was the min_length. Beam size, max_length, prefix, etc all weren’t causing the problem. I verified on both the (sent) => (sent) copy and the (sent) => (first word of sent) tasks.

So at least for my use case it seems like the tokenizer decode bug was not causing a problem? It seems like even though we, as ~users couldn’t decode tokens correctly, the model still knew that 1==EOS and that after an EOS it should print PAD. The problem was that we were forcing it to generate at least 30 tokens, hence all the gibberish that I was seeing.

@sshleifer, does this make sense with your understanding of the finetune script? i.e., that failing to decode EOS shouldn’t matter?

@danyaljj, given that you wanted relatively short outputs of the answers to questions, this seems like it might fix the issue for you? Give it a try and see what happens?

3reactions

sarahwiecommented, Jun 29, 2020

For anyone looking for a quick, temporary fix to the unending-generation problem: override the EOS token with a custom one (note this fix does not work for unk_token or pad_token; for some reason they can’t be re-mapped)

tokenizer = T5Tokenizer.from_pretrained('t5-base')
tokenizer.add_special_tokens({'eos_token':'[EOS]'})

model.resize_token_embeddings(len(tokenizer))

>>> tokenizer.eos_token_id
32100