Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ByT5: problem with tokenizer.decode()

See original GitHub issue

Environment info

transformers version: 4.11.0
Platform: Google Colab
Python version: 3.7.12
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

Who can help

ByT5: @patrickvonplaten Documentation: @sgugger

Information

Model I am using: google/byt5-small (the problem is the same with google/byt5-base).

To reproduce

See this notebook that shows the problem when using google/byt5-small from the model hub of Hugging Face and the tokenizer.decode() method, when the transformers version is 4.11.0.

The problem does not appear with the transformers version 4.9.2 for example.

from transformers import T5ForConditionalGeneration, ByT5Tokenizer

model_checkpoint = 'google/byt5-small'
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
tokenizer = ByT5Tokenizer.from_pretrained(model_checkpoint)

texts = ["Life is like a box of chocolates.", "Today is Monday."]

for text in texts:
  inputs = tokenizer(text, padding="longest", return_tensors="pt")
  output = model.generate(**inputs)
  print(tokenizer.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

Error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-6f8451a23561> in <module>()
      6       output[0],
      7       skip_special_tokens=True,
----> 8       clean_up_tokenization_spaces=True
      9       )
     10   )

2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/byt5/tokenization_byt5.py in convert_tokens_to_string(self, tokens)
    238                 tok_string = bytes([ord(token)])
    239             bstring += tok_string
--> 240         string = bstring.decode("utf-8")
    241         return string
    242 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Expected behavior

2 strings as outputs of the ByT5.

Issue Analytics

State:
Created 2 years ago
Comments:18 (14 by maintainers)

Top GitHub Comments

3reactions

versaecommented, Dec 2, 2021

Thanks, @Narsil.

It seems to be the case! I re-trained the models and they work perfectly fine now, and with good BLEU and CER scores 😃

1reaction

Narsilcommented, Nov 26, 2021

@versae Is it possible that you fed the models with unicode codepoints during training and not utf-8 encoded bytes ? This looks like it, but I can’t be sure. Since I think most accented spanish letters are still below 255 you might not have encountered any issue and been able to train your model just fine.

Just to make sure I tested, that the byt5 tokenizer would encode presunción with the correct encoding:

tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
okenizer.encode('presunción')
>>> [115, 117, 104, 118, 120, 113, 102, 108, 198, 182, 113, 1]
>>> # (195 + 179) so it works

If that’s the case, then the good news is you don’t necessarily need to retrain the model but maybe you need to override this function to change with your fix. Something along the way of :

  def convert_tokens_to_string(self, tokens):
      """Converts a sequence of tokens (string) in a single string."""
      bstring = b""
      for token in tokens:
          if token in self.special_tokens_decoder:
              tok_string = self.special_tokens_decoder[token].encode("utf-8")
          elif token in self.added_tokens_decoder:
              tok_string = self.special_tokens_decoder[token].encode("utf-8")
          elif token in self.special_tokens_encoder:
              tok_string = token.encode("utf-8")
          elif token in self.added_tokens_encoder:
              tok_string = token.encode("utf-8")
          else:
              tok_string = token.encode("utf-8")
          bstring += tok_string
      string = bstring.decode("utf-8", errors="ignore")
      return string

tokenizer.convert_tokens_to_string = convert_tokens_to_string

Keep in mind:

1- This is a dirty hack 2- It might not be the core of the issue (it could be mistrained model, or some other error at training time). If it’s not the core issue, this fix might just be hiding the true culprit and leading to more errors downstream. 3- You have now effectively broken your tokenizer since it won’t encode the same things it decodes

But it should do the job for your specific purpose.

If you could also provide a link/script to how you trained it might provide more insights into what went wrong.