trainer.evaluate() aggregates predictions on GPU and causes CUDA out of memory issues for large datasets
See original GitHub issueEnvironment info
transformers
version: 3.1.0- Platform: Linux-4.15.0-112-generic-x86_64-with-glibc2.10
- Python version: 3.8.5
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
Trainer: @sgugger
Information
Model I am using (Bert, XLNet …): Bert
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Use the Trainer for evaluation (
.evaluate()
,.predict()
) on the GPU with BERT with a large evaluation DataSet where the size of the returned prediction Tensors + Model exceed GPU RAM. (In my case I had an evaluation dataset of 469,530 sentences). - Trainer will crash with a CUDA Memory Exception
Expected behavior
- I would expect the predictions in
predict()
orevaluate()
from each step would be moved to the CPU device (off the GPU) and then concatenated later. However, the tensors are concatenated while still be on the GPU device, and only converted to CPU numpy arrays after the whole dataset has been predicted/evaluated.
This means that for large evaluation datasets you’ll run out of CUDA memory.
It also makes it difficult to pick the batch size to optimize the batch size for the GPU, as you need to allow space for not only the model and inputs, but also all the predictions, which can add up when dealing with large evaluation datasets.
- ie. The problem in the trainer code is that the predictions stay on the GPU here
- These tensors get concatenated but stay on the GPU here
- and then the predictions eventually end up on the CPU here
What this means, is that for larger evaluation datasets, all the predictions stay on the GPU, which is a function of how long your evaluation dataset is, and that you often run out of GPU RAM.
A work around is to something like this and run the loop yourself, and predict by batch:
preds = []
for i in tqdm(range(0, len(ds_valid), step)):
ds_valid = KeywordDataset(tokenizer, df_valid[i:i+step], targets)
batch_preds = trainer.predict(ds_valid)
preds.append(batch_preds)
batch_accuracy = (batch_preds.predictions.argmax(-1) == batch_preds.label_ids).mean()
np_preds = np.concatenate([pred.predictions for pred in preds], axis=0)
np_label_ids = np.concatenate([pred.label_ids for pred in preds], axis=0)
acc = (np_preds.argmax(-1) == np_label_ids).mean()
print('eval_accuracy = ', acc)
It would be nice if it defaulted the cpu (so you don’t have to worry about it), and that there is another trainer argument to give the device to aggregate predictions that can be used to override that behaviour if it’s important for certain use cases.
Thanks for the amazing work on an amazing library. Working with transformers has never been easier due to the hard work of your team!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:8 (4 by maintainers)
In case this solution is not suitable for someone like me, I had the same problem:
out of memory
error using cuda. But for my dataset, I could stay on the CPU for training. So I wrote my own training function and converted the dataset to a pytorch dataloader and it’s done. Then I can totally control the processor 😃.There is a workaround (chunking your dataset in smaller parts) which is why we didn’t implement a quick fix. I’ll start working on the right fix next week and link the PR here once it’s ready @usuyama, so you can see the corresponding part in datasets (I have not yet investigated the exact part in datasets so can’t answer your question right now, I’ve just been told it should be possible 😉 )