Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer's nb_words not taking effect when fit_on_texts is run

See original GitHub issue

Hello, I can’t figure out if I’m doing something wrong, but Tokenizer always seems to ignore the nb_words parameter I provide it and tokenize ALL words rather than just the top nb_words. I’m running Python 2.7. Note that I ran into this issue while working on a dataset with over 1000 unique words, and any lower value I set for nb_words (e.g. 10, 100, 500…) was ignored. Below is a simple example to illustrate quickly what I’m getting. Thank you.

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(nb_words=10)
tokenizer.fit_on_texts(['apple book car dog egg fries girl ham inside jam knife leg monkey nod open pear question rough stone tree umbrella voice wax xylophone year zoo'])
print(len(tokenizer.word_index))
# comes out as 26 rather than 10

Issue Analytics

State:
Created 7 years ago
Comments:13

Top GitHub Comments

2reactions

shafycommented, Nov 1, 2017

Is this behavior intended? If yes, it would be very useful to have a parameter to only get the word_index for the top nb_words

0reactions

hsiaoyi0504commented, Dec 14, 2017

@shafy I mean you can pass the word to word_index and check if the mapping index is under the nb_words. It’s not a hard task. If the word_index dictionary is large enough, you may want generate a filtered version rather than doing this on the fly.

Top Results From Across the Web

Keras Tokenizer num_words doesn't seem to work

Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts ...

tf.keras.preprocessing.text.Tokenizer | TensorFlow v2.11.0

Transforms each text in texts to a sequence of integers. Only top num_words-1 most frequent words will be taken into account. Only words...

Understanding the effect of num_words of Tokenizer in Keras

When I run this, it prints: Found 88582 unique words. My question is, isn't num_words the parameter that controls the number of words ......

THE INTELLIGENT DEGREE PLANNER - ScholarWorks

In order to execute any machine learning task, data needs to be cleaned as we do not want invalid data. Invalid data can...

Analysis of Twitter Data Using Deep Learning Approach: Lstm

Now a days the growth of social websites, running a blog offerings and electronic media con-tributes big ... tokenizer.fit on texts(data[‟text‟].values).