TextHandler Continually Overflows RAM with ~100K Lines of Text
See original GitHub issue- Contextualized Topic Models version: contextualized-topic-models 1.3.1
- Python version: Python 3.6.10
- Operating System: Ubuntu 18.04
Description
Describe what you were trying to get done. Tell us what happened, what went wrong, and what you expected to happen.
I have a .txt file with ~100,000 lines of text, all of which are between 30 and 128 words long. When I try to initialize TextHandler with this dataset and prepare the resulting handler object, the TextHandler pipeline quickly overflows my 128 GB of RAM.
I’m not sure if I’m running something incorrectly or if this method can only work with very small datasets. Hopefully it’s not the latter, given that standard topic modeling and other BERT-based sentence encoding methods work fine on my server with far more data that are even longer (e.g., 512 words maximum) – and with much less RAM.
What I Did
Here is how I initialized the TextHandler and prepared it:
handler = TextHandler("/media/seagate0/amazon/data/lang_pol_data_cult_bks_sampled_text.txt")
handler.prepare()
Halfway through the preparation, Python thew this warning, after which it overflowed the RAM and crashed:
/home/amruch/anaconda3/envs/nlp_polarization/lib/python3.6/site-packages/contextualized_topic_models/utils/data_preparation.py:62: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
self.vocab_dict[x], y.split()))), data)))
Many thanks for any suggestions! Very much looking forward to applying this method!
Issue Analytics
- State:
- Created 3 years ago
- Comments:5

Top Related StackOverflow Question
Wow-wow-wowwww, yeah, not only did that fix the problem but it also made the
TextHandlerpreparemethod run about 100 times faster. Thanks for the suggestion on that. I usually avoid upgrading packages in my environments unless there is a major development, and that was certainly a major development you all made. Thank you so much!Thanks for the other suggestions as well. I’ll try those too!
I’ve just noticed that you are using version 1.3.1 of the package. From version 1.4.1 we are able to handle datasets with larger vocabularies. Updating your package will probably fix your issue 😃
As explained in #7, the BOW representation is necessary for our model and it allows us to extract the most relevant words for each topic. I suggest you preprocess the input text for the BOW, it will reduce the computational costs, and discarding the less frequent words may also lead to better results.
Waiting for your feedback!
Silvia