question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Text Tokenizer is fragmenting words

See original GitHub issue

I’m running into unexpected behavior of the text tokenizer, running this on Windows, Python 3.7 , in a virtual environment, using the supplied image_from_text.py script file.

The input text is tokenized in a way that breaks up the words, thus preventing the output from actually depicting what was requested:

'a comfy chair that looks like an avocado' ->

tokenizing text
['Ġ', 'a']
['Ġ', 'com', 'fy']
['Ġ', 'chair']
['Ġ', 'th', 'at']
['Ġ', 'look', 's']
['Ġ', 'like']
['Ġ', 'an']
['Ġ', 'av', 'oc', 'ado']
text tokens [0, 3, 28, 3, 157, 10065, 3, 10022, 3, 184, 73, 3, 7003, 46, 3, 19831, 3, 65, 3, 178, 158, 1165, 2]

'alien life' ->

tokenizing text
['Ġ', 'al', 'ien']
['Ġ', 'life']
text tokens [0, 3, 71, 1385, 3, 3210, 2]

Since the wrong tokens were chosen, the model returns a generic gamer chair for the first prompt, and some petri dish for the second, which is expected given the garbled tokens.

I checked that the tokenizer.json files were downloaded correctly for both the mini and mega models and they are - manually searching for the words in them finds them in there without any issue.

Is there a specific dependency for the text tokenizer that I’m unaware of or is this simply a bug?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Kreevozcommented, Jun 29, 2022

Yep, just freshly cloned your repo to make sure it’s all okay, and it is. Windows users may rejoice now!

0reactions
kuprelcommented, Jun 29, 2022

Awesome thanks. I just updated it. Does it work now?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenization for Natural Language Processing
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens.
Read more >
Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >
Study of Various Methods for Tokenization | SpringerLink
Tokenization is the mechanism of splitting or fragmenting the sentences and words to its possible smallest morpheme called as token.
Read more >
The Art of Tokenization - Language Processing
Tokenization. The process of segmenting running text into words and sentences. Electronic text is a linear sequence of symbols (characters ...
Read more >
Multi-Word Tokenization for Natural Language Processing
proach to tokenization in NLP is to split text into words using white space ... Fragment Norms (USF) is a collection of word...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found