Add a `max_words` argument to `build_vocab_from_iterator`
See original GitHub issue🚀 Feature
I believe it would be beneficial to limit the number of words you want in your vocabulary with an argument like max_words
, e.g.:
vocab = build_vocab_from_iterator(yield_tokens_batch(file_path), specials=["<unk>"], max_words=50000)
Motivation
This allows a controllable-sized nn.Embedding
, with rare words being mapped to <unk>
. Otherwise, it would not be practical to use build_vocab_from_iterator
for larger datasets.
Alternatives
Keras and Huggingface’s tokenizers would be viable alternatives, but do not nicely integrate with the torchtext ecosystem.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
torchtext.vocab - PyTorch
iterator – Iterator used to build Vocab. Must yield list or iterator of tokens. min_freq – The minimum frequency needed to include a...
Read more >torchtext.vocab - Read the Docs
Counter object holding the frequencies of each word found in the data. max_size – The maximum size of the subword vocabulary, or None...
Read more >Unable to build vocab for a torchtext text classification
This means that in its current state, because strings can be iterated upon, your code will create a vocab able to encode all...
Read more >Pytorch build_vocab_from_iterator giving vocabulary with very ...
I am trying to build a translation model in pytorch. ... but there were very few words in the vocabulary ( len(en_vocab) ->...
Read more >[RFC] Special symbols in torchtext.experimental.Vocab #1016
This issue discusses the special symbols used in the experimental Vocabulary class. Here is a quick search among the vocabulary classes in ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It’s certainly not valid in all cases and your point is well taken. I’d at the very least recommend sticking with “max tokens” or “max vocab size” since words are a bit sticky. 😃
I wonder if this is valid for all types of tokenizers? For instance white-space or other rule/regex based tokenizers may not provide such constraints out-of-the-box, right?