Is a dense-representation BoW necessary?
See original GitHub issueI understand why the BoW matrix is necessary for the combined model. Is it also necessary for the contextual model? The BoW matrix requires num_documents * vocabulary_size memory but is mostly zeroes. Is it possible to modify the contextual model pipeline so that the BoW matrix does not have to be created? Alternatively, can the BoW be stored in a more efficient sparse representation?
A very large instance with ~600GB of memory can only process ~5 million documents with a vocabulary of just 15,000 tokens. That may sound like a lot, but that limit is easily reached if large documents are broken up into smaller sections in order to fit BERT’s input length limitations. And 15,000 tokens is a pretty small vocabulary, especially for a cased model.
For example, we would like to try this method with a vocabulary of 100,000 tokens and ~100M documents, representing about 5GB of text. The BoW matrix for that would require something like 10TB of memory. A sparse representation would require much less memory.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6

Top Related StackOverflow Question
Hi!
The BoW matrix is necessary also for the contextual model because the decoder network has to reconstruct the BoW. Regarding the sparse representation of the BoW matrix, we just created the branch
developwhere the BoW is stored as a sparse matrix. Unfortunately, we are not able to automatically test these changes right now, but the few manual tests have given positive results.Feedback or a contribution to this new feature would be really appreciated! 😃
Thanks a lot,
Silvia
Thanks so much! Really appreciate it! Sorry for the double post!