Text Tokenizer is fragmenting words
See original GitHub issueI’m running into unexpected behavior of the text tokenizer, running this on Windows, Python 3.7 , in a virtual environment, using the supplied image_from_text.py script file.
The input text is tokenized in a way that breaks up the words, thus preventing the output from actually depicting what was requested:
'a comfy chair that looks like an avocado'
->
tokenizing text
['Ġ', 'a']
['Ġ', 'com', 'fy']
['Ġ', 'chair']
['Ġ', 'th', 'at']
['Ġ', 'look', 's']
['Ġ', 'like']
['Ġ', 'an']
['Ġ', 'av', 'oc', 'ado']
text tokens [0, 3, 28, 3, 157, 10065, 3, 10022, 3, 184, 73, 3, 7003, 46, 3, 19831, 3, 65, 3, 178, 158, 1165, 2]
'alien life'
->
tokenizing text
['Ġ', 'al', 'ien']
['Ġ', 'life']
text tokens [0, 3, 71, 1385, 3, 3210, 2]
Since the wrong tokens were chosen, the model returns a generic gamer chair for the first prompt, and some petri dish for the second, which is expected given the garbled tokens.
I checked that the tokenizer.json
files were downloaded correctly for both the mini and mega models and they are - manually searching for the words in them finds them in there without any issue.
Is there a specific dependency for the text tokenizer that I’m unaware of or is this simply a bug?
Issue Analytics
- State:
- Created a year ago
- Comments:13 (5 by maintainers)
Top GitHub Comments
Yep, just freshly cloned your repo to make sure it’s all okay, and it is. Windows users may rejoice now!
Awesome thanks. I just updated it. Does it work now?