question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ByT5: problem with tokenizer.decode()

See original GitHub issue

Environment info

  • transformers version: 4.11.0
  • Platform: Google Colab
  • Python version: 3.7.12
  • Using GPU in script?: NO
  • Using distributed or parallel set-up in script?: NO

Who can help

ByT5: @patrickvonplaten Documentation: @sgugger

Information

Model I am using: google/byt5-small (the problem is the same with google/byt5-base).

To reproduce

See this notebook that shows the problem when using google/byt5-small from the model hub of Hugging Face and the tokenizer.decode() method, when the transformers version is 4.11.0.

The problem does not appear with the transformers version 4.9.2 for example.

from transformers import T5ForConditionalGeneration, ByT5Tokenizer

model_checkpoint = 'google/byt5-small'
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
tokenizer = ByT5Tokenizer.from_pretrained(model_checkpoint)

texts = ["Life is like a box of chocolates.", "Today is Monday."]

for text in texts:
  inputs = tokenizer(text, padding="longest", return_tensors="pt")
  output = model.generate(**inputs)
  print(tokenizer.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

Error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-6f8451a23561> in <module>()
      6       output[0],
      7       skip_special_tokens=True,
----> 8       clean_up_tokenization_spaces=True
      9       )
     10   )

2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/byt5/tokenization_byt5.py in convert_tokens_to_string(self, tokens)
    238                 tok_string = bytes([ord(token)])
    239             bstring += tok_string
--> 240         string = bstring.decode("utf-8")
    241         return string
    242 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Expected behavior

2 strings as outputs of the ByT5.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (14 by maintainers)

github_iconTop GitHub Comments

3reactions
versaecommented, Dec 2, 2021

Thanks, @Narsil.

It seems to be the case! I re-trained the models and they work perfectly fine now, and with good BLEU and CER scores 😃

1reaction
Narsilcommented, Nov 26, 2021

@versae Is it possible that you fed the models with unicode codepoints during training and not utf-8 encoded bytes ? This looks like it, but I can’t be sure. Since I think most accented spanish letters are still below 255 you might not have encountered any issue and been able to train your model just fine.

Just to make sure I tested, that the byt5 tokenizer would encode presunción with the correct encoding:

tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
okenizer.encode('presunción')
>>> [115, 117, 104, 118, 120, 113, 102, 108, 198, 182, 113, 1]
>>> # (195 + 179) so it works

If that’s the case, then the good news is you don’t necessarily need to retrain the model but maybe you need to override this function to change with your fix. Something along the way of :

  def convert_tokens_to_string(self, tokens):
      """Converts a sequence of tokens (string) in a single string."""
      bstring = b""
      for token in tokens:
          if token in self.special_tokens_decoder:
              tok_string = self.special_tokens_decoder[token].encode("utf-8")
          elif token in self.added_tokens_decoder:
              tok_string = self.special_tokens_decoder[token].encode("utf-8")
          elif token in self.special_tokens_encoder:
              tok_string = token.encode("utf-8")
          elif token in self.added_tokens_encoder:
              tok_string = token.encode("utf-8")
          else:
              tok_string = token.encode("utf-8")
          bstring += tok_string
      string = bstring.decode("utf-8", errors="ignore")
      return string

tokenizer.convert_tokens_to_string = convert_tokens_to_string

Keep in mind:

1- This is a dirty hack 2- It might not be the core of the issue (it could be mistrained model, or some other error at training time). If it’s not the core issue, this fix might just be hiding the true culprit and leading to more errors downstream. 3- You have now effectively broken your tokenizer since it won’t encode the same things it decodes

But it should do the job for your specific purpose.

If you could also provide a link/script to how you trained it might provide more insights into what went wrong.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ByT5: problem with tokenizer.decode() - Hugging Face Forums
Hi. I've created a Colab notebook to show a problem when using google/byt5-small from the model hub of Hugging Face and model.generate() ....
Read more >
Unicode issue with tokenizer.decode() #7751 - GitHub
Steps to reproduce the behavior: Run this (I'm running in a Python shell, but it's reproducible in a script). >>> from transformers import ......
Read more >
ValueError: bytes must be in range(0, 256) while decoding ...
ValueError: bytes must be in range(0, 256 ) while decoding input tensor using transformer AutoTokenizer (MT5ForConditionalGerneration Model).
Read more >
ByT5: Towards a Token-Free Future with Pre-trained Byte-to ...
A given piece of text is thus converted into a sequence of tokens by a tokenizer before being fed into a model for...
Read more >
Nick Doiron on Twitter: "@pierre_guillou It must be due to recent ...
ByT5 was trained on multilingual text including Thai. ... you have already worded on ByT5, what do you think of the problem with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found