question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add a `max_words` argument to `build_vocab_from_iterator`

See original GitHub issue

🚀 Feature

Link to the docs

I believe it would be beneficial to limit the number of words you want in your vocabulary with an argument like max_words, e.g.:

vocab = build_vocab_from_iterator(yield_tokens_batch(file_path), specials=["<unk>"], max_words=50000)

Motivation

This allows a controllable-sized nn.Embedding, with rare words being mapped to <unk>. Otherwise, it would not be practical to use build_vocab_from_iterator for larger datasets.

Alternatives

Keras and Huggingface’s tokenizers would be viable alternatives, but do not nicely integrate with the torchtext ecosystem.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
eripcommented, Jan 16, 2022

It’s certainly not valid in all cases and your point is well taken. I’d at the very least recommend sticking with “max tokens” or “max vocab size” since words are a bit sticky. 😃

1reaction
parmeetcommented, Jan 16, 2022

I’ll add that normally your “upstream” tokenization strategy will handle this for you.

I wonder if this is valid for all types of tokenizers? For instance white-space or other rule/regex based tokenizers may not provide such constraints out-of-the-box, right?

Read more comments on GitHub >

github_iconTop Results From Across the Web

torchtext.vocab - PyTorch
iterator – Iterator used to build Vocab. Must yield list or iterator of tokens. min_freq – The minimum frequency needed to include a...
Read more >
torchtext.vocab - Read the Docs
Counter object holding the frequencies of each word found in the data. max_size – The maximum size of the subword vocabulary, or None...
Read more >
Unable to build vocab for a torchtext text classification
This means that in its current state, because strings can be iterated upon, your code will create a vocab able to encode all...
Read more >
Pytorch build_vocab_from_iterator giving vocabulary with very ...
I am trying to build a translation model in pytorch. ... but there were very few words in the vocabulary ( len(en_vocab) ->...
Read more >
[RFC] Special symbols in torchtext.experimental.Vocab #1016
This issue discusses the special symbols used in the experimental Vocabulary class. Here is a quick search among the vocabulary classes in ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found