question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is a dense-representation BoW necessary?

See original GitHub issue

I understand why the BoW matrix is necessary for the combined model. Is it also necessary for the contextual model? The BoW matrix requires num_documents * vocabulary_size memory but is mostly zeroes. Is it possible to modify the contextual model pipeline so that the BoW matrix does not have to be created? Alternatively, can the BoW be stored in a more efficient sparse representation?

A very large instance with ~600GB of memory can only process ~5 million documents with a vocabulary of just 15,000 tokens. That may sound like a lot, but that limit is easily reached if large documents are broken up into smaller sections in order to fit BERT’s input length limitations. And 15,000 tokens is a pretty small vocabulary, especially for a cased model.

For example, we would like to try this method with a vocabulary of 100,000 tokens and ~100M documents, representing about 5GB of text. The BoW matrix for that would require something like 10TB of memory. A sparse representation would require much less memory.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6

github_iconTop GitHub Comments

3reactions
silviatticommented, Jul 30, 2020

Hi!

The BoW matrix is necessary also for the contextual model because the decoder network has to reconstruct the BoW. Regarding the sparse representation of the BoW matrix, we just created the branch develop where the BoW is stored as a sparse matrix. Unfortunately, we are not able to automatically test these changes right now, but the few manual tests have given positive results.

Feedback or a contribution to this new feature would be really appreciated! 😃

Thanks a lot,

Silvia

0reactions
AlexMRuchcommented, Aug 16, 2020

Thanks so much! Really appreciate it! Sorry for the double post!

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Gentle Introduction to the Bag-of-Words Model
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of ......
Read more >
How deep does your Sentence Embedding model need to be
1. Sparse Representations and BOW. Probably one of the most straightforward sentence embedding techniques is Bag Of Words (BOW). This method ...
Read more >
Working with the Bag-of-Words representation - tmtoolkit
Note that the generated dataframe is dense, i.e. it uses up (much) more memory than the input DTM. We can see that an...
Read more >
Introduction to the Bag-of-Words (BoW) Model - PyImageSearch
Introduction to the BoW Model. The Bag-of-Words model is a simple method for extracting features from text data. The idea is to represent...
Read more >
Beyond Word Embeddings Part 2. A primer in the neural nlp ...
BoW representations are often used in methods of document classification where the frequency of each word, bi-word or tri-word is a useful ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found