Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TextHandler Continually Overflows RAM with ~100K Lines of Text

See original GitHub issue

Contextualized Topic Models version: contextualized-topic-models 1.3.1
Python version: Python 3.6.10
Operating System: Ubuntu 18.04

Description

Describe what you were trying to get done. Tell us what happened, what went wrong, and what you expected to happen.

I have a .txt file with ~100,000 lines of text, all of which are between 30 and 128 words long. When I try to initialize TextHandler with this dataset and prepare the resulting handler object, the TextHandler pipeline quickly overflows my 128 GB of RAM.

I’m not sure if I’m running something incorrectly or if this method can only work with very small datasets. Hopefully it’s not the latter, given that standard topic modeling and other BERT-based sentence encoding methods work fine on my server with far more data that are even longer (e.g., 512 words maximum) – and with much less RAM.

What I Did

Here is how I initialized the TextHandler and prepared it:

handler = TextHandler("/media/seagate0/amazon/data/lang_pol_data_cult_bks_sampled_text.txt")
handler.prepare()

Halfway through the preparation, Python thew this warning, after which it overflowed the RAM and crashed:

/home/amruch/anaconda3/envs/nlp_polarization/lib/python3.6/site-packages/contextualized_topic_models/utils/data_preparation.py:62: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  self.vocab_dict[x], y.split()))), data)))

Many thanks for any suggestions! Very much looking forward to applying this method!

Issue Analytics

State:
Created 3 years ago
Comments:5

Top GitHub Comments

1reaction

AlexMRuchcommented, Aug 16, 2020

Wow-wow-wowwww, yeah, not only did that fix the problem but it also made the TextHandler prepare method run about 100 times faster. Thanks for the suggestion on that. I usually avoid upgrading packages in my environments unless there is a major development, and that was certainly a major development you all made. Thank you so much!

Thanks for the other suggestions as well. I’ll try those too!

0reactions

silviatticommented, Aug 16, 2020

I’ve just noticed that you are using version 1.3.1 of the package. From version 1.4.1 we are able to handle datasets with larger vocabularies. Updating your package will probably fix your issue 😃

As explained in #7, the BOW representation is necessary for our model and it allows us to extract the most relevant words for each topic. I suggest you preprocess the input text for the BOW, it will reduce the computational costs, and discarding the less frequent words may also lead to better results.

Waiting for your feedback!

Silvia