question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

batch_encode_plus() causes OOM, while encode_plus does not

See original GitHub issue

❓ Questions & Help

Details

I am running a sequence classification task using DistilBertForSequenceClassfication. I follow examples/text_classfication/run_glue.py and src/transformers/data/processors/glue.py to implement my data loading process. My dataset is a rather large one (~2.5 GB with 7M+ examples), compared to those of the GLUE tasks.

In the current glue.py, _glue_convert_examples_to_features() reads all the examples into a list, and then call batch_encode_plus() on that list. On my large dataset, this implementation caused an out-of-memory (OOM) error. Therefore, I switched to encode_plus(), and called it on individual data example while looping through the dataset. encode_plus() did not cause OOM.

I wonder if there is something wrong with batch_encode_plus() so that it cannot handle all the examples in a dataset at once? If that is the case, it might be a good idea to add a corresponding note to the documentation.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
Lucianod28commented, Aug 29, 2020

I’m also running out of memory using BertTokenizerFast.batch_encode_plus(). I’m using the BertTokenizer.batch_encode_plus() now and it seems to be working but its very slow and single-threaded! I have 220 GB RAM and the dataset is under 2 GB 😞 .

0reactions
YusufBaig7commented, Dec 20, 2022

any solutions to this? facing the same issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer.batch_encode_plus uses all my RAM - Beginners
I only have 25GB RAM and everytime I try to run the below code my google colab crashes. Any idea how to prevent...
Read more >
Problem with batch_encode_plus method of tokenizer
I have 2 sentences whom I need to encode, and I have a case where the sentences are already tokenized, but since both...
Read more >
enCodePlus: Planning, Zoning & Municipal Code Software
Discover enCodePlus, the industry's leading planning and zoning and municipal code management technology platform.
Read more >
enCodePlus | Sugar Land TX - Facebook
Keast Collaborative are using enCodePlus to zone for ... that make room for food trucks while protecting the character of a community. How...
Read more >
enCodePlus - LinkedIn
enCodePlus is a software solution designed to communicate technical documents in a reader-friendly format while providing a backend web publishing platform ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found