Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is a dense-representation BoW necessary?

See original GitHub issue

I understand why the BoW matrix is necessary for the combined model. Is it also necessary for the contextual model? The BoW matrix requires num_documents * vocabulary_size memory but is mostly zeroes. Is it possible to modify the contextual model pipeline so that the BoW matrix does not have to be created? Alternatively, can the BoW be stored in a more efficient sparse representation?

A very large instance with ~600GB of memory can only process ~5 million documents with a vocabulary of just 15,000 tokens. That may sound like a lot, but that limit is easily reached if large documents are broken up into smaller sections in order to fit BERT’s input length limitations. And 15,000 tokens is a pretty small vocabulary, especially for a cased model.

For example, we would like to try this method with a vocabulary of 100,000 tokens and ~100M documents, representing about 5GB of text. The BoW matrix for that would require something like 10TB of memory. A sparse representation would require much less memory.

Issue Analytics

State:
Created 3 years ago
Comments:6

Top GitHub Comments

3reactions

silviatticommented, Jul 30, 2020

Hi!

The BoW matrix is necessary also for the contextual model because the decoder network has to reconstruct the BoW. Regarding the sparse representation of the BoW matrix, we just created the branch develop where the BoW is stored as a sparse matrix. Unfortunately, we are not able to automatically test these changes right now, but the few manual tests have given positive results.

Feedback or a contribution to this new feature would be really appreciated! 😃

Thanks a lot,

Silvia

0reactions

AlexMRuchcommented, Aug 16, 2020

Thanks so much! Really appreciate it! Sorry for the double post!