question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cache build_vocab; Shared vocabulary

See original GitHub issue

src.build_vocab(mt_train, max_size=80000) trg.build_vocab(mt_train, max_size=40000)

In the README example, it looks like build_vocab is used twice on the same dataset. For large datasets this could take awhile.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
jekbradburycommented, Jul 9, 2017

The two calls iterate through entirely separate data: the semantics of field.build_vocab(dataset) are to build the vocab for the field from every column in the provided dataset that is associated with that field.

0reactions
joecummingscommented, Sep 7, 2022

Closing as stale.

Read more comments on GitHub >

github_iconTop Results From Across the Web

torchtext.vocab - PyTorch
Initializes internal Module state, shared by both nn. ... iterator – Iterator used to build Vocab. ... cache – directory for cached vectors....
Read more >
How to pass new pre-trained embeddings while sharing the ...
I'm afraid the only way is to loop through vectors_imdb keeping only words that are in my vocab, sorting them so that the...
Read more >
Gensim: share vocabulary across models | Luca Papariello blog
A brief illustration of how to share a common vocabulary among different Gensim models.
Read more >
torchtext.vocab - Read the Docs
Defines a vocabulary object that will be used to numericalize a field. ... Tensor.zero_; vectors_cache – directory for cached vectors.
Read more >
Can I use a different corpus for fasttext build_vocab than train ...
The build_vocab() call establishes the known vocabulary of the model, & caches some stats about the corpus. If you then supply another corpus...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found