Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

batch_encode_plus() causes OOM, while encode_plus does not

See original GitHub issue

❓ Questions & Help

Details

I am running a sequence classification task using DistilBertForSequenceClassfication. I follow examples/text_classfication/run_glue.py and src/transformers/data/processors/glue.py to implement my data loading process. My dataset is a rather large one (~2.5 GB with 7M+ examples), compared to those of the GLUE tasks.

In the current glue.py, _glue_convert_examples_to_features() reads all the examples into a list, and then call batch_encode_plus() on that list. On my large dataset, this implementation caused an out-of-memory (OOM) error. Therefore, I switched to encode_plus(), and called it on individual data example while looping through the dataset. encode_plus() did not cause OOM.

I wonder if there is something wrong with batch_encode_plus() so that it cannot handle all the examples in a dataset at once? If that is the case, it might be a good idea to add a corresponding note to the documentation.

Issue Analytics

State:
Created 3 years ago
Comments:5

Top GitHub Comments

1reaction

Lucianod28commented, Aug 29, 2020

I’m also running out of memory using BertTokenizerFast.batch_encode_plus(). I’m using the BertTokenizer.batch_encode_plus() now and it seems to be working but its very slow and single-threaded! I have 220 GB RAM and the dataset is under 2 GB 😞 .

0reactions

YusufBaig7commented, Dec 20, 2022

any solutions to this? facing the same issue

Top Results From Across the Web

Tokenizer.batch_encode_plus uses all my RAM - Beginners

I only have 25GB RAM and everytime I try to run the below code my google colab crashes. Any idea how to prevent...

Problem with batch_encode_plus method of tokenizer

I have 2 sentences whom I need to encode, and I have a case where the sentences are already tokenized, but since both...

enCodePlus: Planning, Zoning & Municipal Code Software

Discover enCodePlus, the industry's leading planning and zoning and municipal code management technology platform.

enCodePlus | Sugar Land TX - Facebook

Keast Collaborative are using enCodePlus to zone for ... that make room for food trucks while protecting the character of a community. How...

enCodePlus - LinkedIn

enCodePlus is a software solution designed to communicate technical documents in a reader-friendly format while providing a backend web publishing platform ...