question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer's nb_words not taking effect when fit_on_texts is run

See original GitHub issue

Hello, I can’t figure out if I’m doing something wrong, but Tokenizer always seems to ignore the nb_words parameter I provide it and tokenize ALL words rather than just the top nb_words. I’m running Python 2.7. Note that I ran into this issue while working on a dataset with over 1000 unique words, and any lower value I set for nb_words (e.g. 10, 100, 500…) was ignored. Below is a simple example to illustrate quickly what I’m getting. Thank you.

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(nb_words=10)
tokenizer.fit_on_texts(['apple book car dog egg fries girl ham inside jam knife leg monkey nod open pear question rough stone tree umbrella voice wax xylophone year zoo'])
print(len(tokenizer.word_index))
# comes out as 26 rather than 10

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:13

github_iconTop GitHub Comments

2reactions
shafycommented, Nov 1, 2017

Is this behavior intended? If yes, it would be very useful to have a parameter to only get the word_index for the top nb_words

0reactions
hsiaoyi0504commented, Dec 14, 2017

@shafy I mean you can pass the word to word_index and check if the mapping index is under the nb_words. It’s not a hard task. If the word_index dictionary is large enough, you may want generate a filtered version rather than doing this on the fly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keras Tokenizer num_words doesn't seem to work
Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts ...
Read more >
tf.keras.preprocessing.text.Tokenizer | TensorFlow v2.11.0
Transforms each text in texts to a sequence of integers. Only top num_words-1 most frequent words will be taken into account. Only words...
Read more >
Understanding the effect of num_words of Tokenizer in Keras
When I run this, it prints: Found 88582 unique words. My question is, isn't num_words the parameter that controls the number of words ......
Read more >
THE INTELLIGENT DEGREE PLANNER - ScholarWorks
In order to execute any machine learning task, data needs to be cleaned as we do not want invalid data. Invalid data can...
Read more >
Analysis of Twitter Data Using Deep Learning Approach: Lstm
Now a days the growth of social websites, running a blog offerings and electronic media con-tributes big ... tokenizer.fit on texts(data[‟text‟].values).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found