ByT5: problem with tokenizer.decode()
See original GitHub issueEnvironment info
- transformers version: 4.11.0
- Platform: Google Colab
- Python version: 3.7.12
- Using GPU in script?: NO
- Using distributed or parallel set-up in script?: NO
Who can help
ByT5: @patrickvonplaten Documentation: @sgugger
Information
Model I am using: google/byt5-small
(the problem is the same with google/byt5-base
).
To reproduce
See this notebook that shows the problem when using google/byt5-small
from the model hub of Hugging Face and the tokenizer.decode()
method, when the transformers
version is 4.11.0.
The problem does not appear with the transformers
version 4.9.2 for example.
from transformers import T5ForConditionalGeneration, ByT5Tokenizer
model_checkpoint = 'google/byt5-small'
model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
tokenizer = ByT5Tokenizer.from_pretrained(model_checkpoint)
texts = ["Life is like a box of chocolates.", "Today is Monday."]
for text in texts:
inputs = tokenizer(text, padding="longest", return_tensors="pt")
output = model.generate(**inputs)
print(tokenizer.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
Error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-8-6f8451a23561> in <module>()
6 output[0],
7 skip_special_tokens=True,
----> 8 clean_up_tokenization_spaces=True
9 )
10 )
2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/byt5/tokenization_byt5.py in convert_tokens_to_string(self, tokens)
238 tok_string = bytes([ord(token)])
239 bstring += tok_string
--> 240 string = bstring.decode("utf-8")
241 return string
242
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Expected behavior
2 strings as outputs of the ByT5.
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (14 by maintainers)
Top Results From Across the Web
ByT5: problem with tokenizer.decode() - Hugging Face Forums
Hi. I've created a Colab notebook to show a problem when using google/byt5-small from the model hub of Hugging Face and model.generate() ....
Read more >Unicode issue with tokenizer.decode() #7751 - GitHub
Steps to reproduce the behavior: Run this (I'm running in a Python shell, but it's reproducible in a script). >>> from transformers import ......
Read more >ValueError: bytes must be in range(0, 256) while decoding ...
ValueError: bytes must be in range(0, 256 ) while decoding input tensor using transformer AutoTokenizer (MT5ForConditionalGerneration Model).
Read more >ByT5: Towards a Token-Free Future with Pre-trained Byte-to ...
A given piece of text is thus converted into a sequence of tokens by a tokenizer before being fed into a model for...
Read more >Nick Doiron on Twitter: "@pierre_guillou It must be due to recent ...
ByT5 was trained on multilingual text including Thai. ... you have already worded on ByT5, what do you think of the problem with...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks, @Narsil.
It seems to be the case! I re-trained the models and they work perfectly fine now, and with good BLEU and CER scores 😃
@versae Is it possible that you fed the models with unicode codepoints during training and not utf-8 encoded bytes ? This looks like it, but I can’t be sure. Since I think most accented spanish letters are still below 255 you might not have encountered any issue and been able to train your model just fine.
Just to make sure I tested, that the
byt5
tokenizer would encodepresunción
with the correct encoding:If that’s the case, then the good news is you don’t necessarily need to retrain the model but maybe you need to override this function to change with your fix. Something along the way of :
Keep in mind:
1- This is a dirty hack 2- It might not be the core of the issue (it could be mistrained model, or some other error at training time). If it’s not the core issue, this fix might just be hiding the true culprit and leading to more errors downstream. 3- You have now effectively broken your tokenizer since it won’t encode the same things it decodes
But it should do the job for your specific purpose.
If you could also provide a link/script to how you trained it might provide more insights into what went wrong.