Tokenizer's nb_words not taking effect when fit_on_texts is run
See original GitHub issueHello, I can’t figure out if I’m doing something wrong, but Tokenizer always seems to ignore the nb_words parameter I provide it and tokenize ALL words rather than just the top nb_words. I’m running Python 2.7. Note that I ran into this issue while working on a dataset with over 1000 unique words, and any lower value I set for nb_words (e.g. 10, 100, 500…) was ignored. Below is a simple example to illustrate quickly what I’m getting. Thank you.
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(nb_words=10)
tokenizer.fit_on_texts(['apple book car dog egg fries girl ham inside jam knife leg monkey nod open pear question rough stone tree umbrella voice wax xylophone year zoo'])
print(len(tokenizer.word_index))
# comes out as 26 rather than 10
Issue Analytics
- State:
- Created 7 years ago
- Comments:13
Top Results From Across the Web
Keras Tokenizer num_words doesn't seem to work
Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts ...
Read more >tf.keras.preprocessing.text.Tokenizer | TensorFlow v2.11.0
Transforms each text in texts to a sequence of integers. Only top num_words-1 most frequent words will be taken into account. Only words...
Read more >Understanding the effect of num_words of Tokenizer in Keras
When I run this, it prints: Found 88582 unique words. My question is, isn't num_words the parameter that controls the number of words ......
Read more >THE INTELLIGENT DEGREE PLANNER - ScholarWorks
In order to execute any machine learning task, data needs to be cleaned as we do not want invalid data. Invalid data can...
Read more >Analysis of Twitter Data Using Deep Learning Approach: Lstm
Now a days the growth of social websites, running a blog offerings and electronic media con-tributes big ... tokenizer.fit on texts(data[‟text‟].values).
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Is this behavior intended? If yes, it would be very useful to have a parameter to only get the
word_index
for the topnb_words
@shafy I mean you can pass the word to word_index and check if the mapping index is under the
nb_words
. It’s not a hard task. If the word_index dictionary is large enough, you may want generate a filtered version rather than doing this on the fly.