Cache build_vocab; Shared vocabulary
See original GitHub issuesrc.build_vocab(mt_train, max_size=80000) trg.build_vocab(mt_train, max_size=40000)
In the README example, it looks like build_vocab is used twice on the same dataset. For large datasets this could take awhile.
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
torchtext.vocab - PyTorch
Initializes internal Module state, shared by both nn. ... iterator – Iterator used to build Vocab. ... cache – directory for cached vectors....
Read more >How to pass new pre-trained embeddings while sharing the ...
I'm afraid the only way is to loop through vectors_imdb keeping only words that are in my vocab, sorting them so that the...
Read more >Gensim: share vocabulary across models | Luca Papariello blog
A brief illustration of how to share a common vocabulary among different Gensim models.
Read more >torchtext.vocab - Read the Docs
Defines a vocabulary object that will be used to numericalize a field. ... Tensor.zero_; vectors_cache – directory for cached vectors.
Read more >Can I use a different corpus for fasttext build_vocab than train ...
The build_vocab() call establishes the known vocabulary of the model, & caches some stats about the corpus. If you then supply another corpus...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The two calls iterate through entirely separate data: the semantics of
field.build_vocab(dataset)
are to build the vocab for the field from every column in the provided dataset that is associated with that field.Closing as stale.