question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TextHandler Continually Overflows RAM with ~100K Lines of Text

See original GitHub issue
  • Contextualized Topic Models version: contextualized-topic-models 1.3.1
  • Python version: Python 3.6.10
  • Operating System: Ubuntu 18.04

Description

Describe what you were trying to get done. Tell us what happened, what went wrong, and what you expected to happen.

I have a .txt file with ~100,000 lines of text, all of which are between 30 and 128 words long. When I try to initialize TextHandler with this dataset and prepare the resulting handler object, the TextHandler pipeline quickly overflows my 128 GB of RAM.

I’m not sure if I’m running something incorrectly or if this method can only work with very small datasets. Hopefully it’s not the latter, given that standard topic modeling and other BERT-based sentence encoding methods work fine on my server with far more data that are even longer (e.g., 512 words maximum) – and with much less RAM.

What I Did

Here is how I initialized the TextHandler and prepared it:

handler = TextHandler("/media/seagate0/amazon/data/lang_pol_data_cult_bks_sampled_text.txt")
handler.prepare()

Halfway through the preparation, Python thew this warning, after which it overflowed the RAM and crashed:

/home/amruch/anaconda3/envs/nlp_polarization/lib/python3.6/site-packages/contextualized_topic_models/utils/data_preparation.py:62: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  self.vocab_dict[x], y.split()))), data)))

Many thanks for any suggestions! Very much looking forward to applying this method!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
AlexMRuchcommented, Aug 16, 2020

Wow-wow-wowwww, yeah, not only did that fix the problem but it also made the TextHandler prepare method run about 100 times faster. Thanks for the suggestion on that. I usually avoid upgrading packages in my environments unless there is a major development, and that was certainly a major development you all made. Thank you so much!

Thanks for the other suggestions as well. I’ll try those too!

0reactions
silviatticommented, Aug 16, 2020

I’ve just noticed that you are using version 1.3.1 of the package. From version 1.4.1 we are able to handle datasets with larger vocabularies. Updating your package will probably fix your issue 😃

As explained in #7, the BOW representation is necessary for our model and it allows us to extract the most relevant words for each topic. I suggest you preprocess the input text for the BOW, it will reduce the computational costs, and discarding the less frequent words may also lead to better results.

Waiting for your feedback!

Silvia

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to process 100 Million + text lines at once
To process 100 million lines at once you would have to have 100 million threads. Another approach to improve the speed of your...
Read more >
What method for storing a text file in memory (c not c++) ...
Currently I have the text files open into "chunks". ... If this is for a text editor, don't you need only the lines...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found