Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

trainer.evaluate() aggregates predictions on GPU and causes CUDA out of memory issues for large datasets

See original GitHub issue

Environment info

transformers version: 3.1.0
Platform: Linux-4.15.0-112-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.6.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

Trainer: @sgugger

Information

Model I am using (Bert, XLNet …): Bert

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Use the Trainer for evaluation (.evaluate(), .predict()) on the GPU with BERT with a large evaluation DataSet where the size of the returned prediction Tensors + Model exceed GPU RAM. (In my case I had an evaluation dataset of 469,530 sentences).
Trainer will crash with a CUDA Memory Exception

Expected behavior

I would expect the predictions in predict() or evaluate() from each step would be moved to the CPU device (off the GPU) and then concatenated later. However, the tensors are concatenated while still be on the GPU device, and only converted to CPU numpy arrays after the whole dataset has been predicted/evaluated.

This means that for large evaluation datasets you’ll run out of CUDA memory.

It also makes it difficult to pick the batch size to optimize the batch size for the GPU, as you need to allow space for not only the model and inputs, but also all the predictions, which can add up when dealing with large evaluation datasets.

ie. The problem in the trainer code is that the predictions stay on the GPU here
These tensors get concatenated but stay on the GPU here
and then the predictions eventually end up on the CPU here

What this means, is that for larger evaluation datasets, all the predictions stay on the GPU, which is a function of how long your evaluation dataset is, and that you often run out of GPU RAM.

A work around is to something like this and run the loop yourself, and predict by batch:

    preds = []
    for i in tqdm(range(0, len(ds_valid), step)):
        ds_valid = KeywordDataset(tokenizer, df_valid[i:i+step], targets)
        batch_preds = trainer.predict(ds_valid)
        preds.append(batch_preds)
        batch_accuracy = (batch_preds.predictions.argmax(-1) == batch_preds.label_ids).mean()

    np_preds = np.concatenate([pred.predictions for pred in preds], axis=0)
    np_label_ids = np.concatenate([pred.label_ids for pred in preds], axis=0)
    acc = (np_preds.argmax(-1) == np_label_ids).mean()
    print('eval_accuracy = ', acc)

It would be nice if it defaulted the cpu (so you don’t have to worry about it), and that there is another trainer argument to give the device to aggregate predictions that can be used to override that behaviour if it’s important for certain use cases.

Thanks for the amazing work on an amazing library. Working with transformers has never been easier due to the hard work of your team!

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:8 (4 by maintainers)

Top GitHub Comments

5reactions

Godwinh19commented, May 12, 2022

In case this solution is not suitable for someone like me, I had the same problem: out of memory error using cuda. But for my dataset, I could stay on the CPU for training. So I wrote my own training function and converted the dataset to a pytorch dataloader and it’s done. Then I can totally control the processor 😃.

4reactions

sguggercommented, Oct 11, 2020

There is a workaround (chunking your dataset in smaller parts) which is why we didn’t implement a quick fix. I’ll start working on the right fix next week and link the PR here once it’s ready @usuyama, so you can see the corresponding part in datasets (I have not yet investigated the exact part in datasets so can’t answer your question right now, I’ve just been told it should be possible 😉 )