batch_encode_plus() causes OOM, while encode_plus does not
See original GitHub issue❓ Questions & Help
Details
I am running a sequence classification task using DistilBertForSequenceClassfication
. I follow examples/text_classfication/run_glue.py
and src/transformers/data/processors/glue.py
to implement my data loading process. My dataset is a rather large one (~2.5 GB with 7M+ examples), compared to those of the GLUE tasks.
In the current glue.py
, _glue_convert_examples_to_features()
reads all the examples into a list, and then call batch_encode_plus()
on that list. On my large dataset, this implementation caused an out-of-memory (OOM) error. Therefore, I switched to encode_plus()
, and called it on individual data example while looping through the dataset. encode_plus()
did not cause OOM.
I wonder if there is something wrong with batch_encode_plus()
so that it cannot handle all the examples in a dataset at once? If that is the case, it might be a good idea to add a corresponding note to the documentation.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top GitHub Comments
I’m also running out of memory using
BertTokenizerFast.batch_encode_plus()
. I’m using theBertTokenizer.batch_encode_plus()
now and it seems to be working but its very slow and single-threaded! I have 220 GB RAM and the dataset is under 2 GB 😞 .any solutions to this? facing the same issue