Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Text Tokenizer is fragmenting words

See original GitHub issue

I’m running into unexpected behavior of the text tokenizer, running this on Windows, Python 3.7 , in a virtual environment, using the supplied image_from_text.py script file.

The input text is tokenized in a way that breaks up the words, thus preventing the output from actually depicting what was requested:

'a comfy chair that looks like an avocado' ->

tokenizing text
['Ġ', 'a']
['Ġ', 'com', 'fy']
['Ġ', 'chair']
['Ġ', 'th', 'at']
['Ġ', 'look', 's']
['Ġ', 'like']
['Ġ', 'an']
['Ġ', 'av', 'oc', 'ado']
text tokens [0, 3, 28, 3, 157, 10065, 3, 10022, 3, 184, 73, 3, 7003, 46, 3, 19831, 3, 65, 3, 178, 158, 1165, 2]

'alien life' ->

tokenizing text
['Ġ', 'al', 'ien']
['Ġ', 'life']
text tokens [0, 3, 71, 1385, 3, 3210, 2]

Since the wrong tokens were chosen, the model returns a generic gamer chair for the first prompt, and some petri dish for the second, which is expected given the garbled tokens.

I checked that the tokenizer.json files were downloaded correctly for both the mini and mega models and they are - manually searching for the words in them finds them in there without any issue.

Is there a specific dependency for the text tokenizer that I’m unaware of or is this simply a bug?

Issue Analytics

State:
Created a year ago
Comments:13 (5 by maintainers)

Top GitHub Comments

1reaction

Kreevozcommented, Jun 29, 2022

Yep, just freshly cloned your repo to make sure it’s all okay, and it is. Windows users may rejoice now!

0reactions

kuprelcommented, Jun 29, 2022

Awesome thanks. I just updated it. Does it work now?

Read more comments on GitHub >

Top Results From Across the Web

Tokenization for Natural Language Processing

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens.

Summary of the tokenizers - Hugging Face

As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...

Study of Various Methods for Tokenization | SpringerLink

Tokenization is the mechanism of splitting or fragmenting the sentences and words to its possible smallest morpheme called as token.

The Art of Tokenization - Language Processing

Tokenization. The process of segmenting running text into words and sentences. Electronic text is a linear sequence of symbols (characters ...

Multi-Word Tokenization for Natural Language Processing

proach to tokenization in NLP is to split text into words using white space ... Fragment Norms (USF) is a collection of word...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Recent commit results in higher VRAM usage

thanks (it's 10x faster than JAX)!