Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add a `max_words` argument to `build_vocab_from_iterator`

See original GitHub issue

🚀 Feature

Link to the docs

I believe it would be beneficial to limit the number of words you want in your vocabulary with an argument like max_words, e.g.:

vocab = build_vocab_from_iterator(yield_tokens_batch(file_path), specials=["<unk>"], max_words=50000)

Motivation

This allows a controllable-sized nn.Embedding, with rare words being mapped to <unk>. Otherwise, it would not be practical to use build_vocab_from_iterator for larger datasets.

Alternatives

Keras and Huggingface’s tokenizers would be viable alternatives, but do not nicely integrate with the torchtext ecosystem.

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

eripcommented, Jan 16, 2022

It’s certainly not valid in all cases and your point is well taken. I’d at the very least recommend sticking with “max tokens” or “max vocab size” since words are a bit sticky. 😃

1reaction

parmeetcommented, Jan 16, 2022

I’ll add that normally your “upstream” tokenization strategy will handle this for you.

I wonder if this is valid for all types of tokenizers? For instance white-space or other rule/regex based tokenizers may not provide such constraints out-of-the-box, right?