question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

trainer.evaluate() aggregates predictions on GPU and causes CUDA out of memory issues for large datasets

See original GitHub issue

Environment info

  • transformers version: 3.1.0
  • Platform: Linux-4.15.0-112-generic-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.6.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

Trainer: @sgugger

Information

Model I am using (Bert, XLNet …): Bert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Use the Trainer for evaluation (.evaluate(), .predict()) on the GPU with BERT with a large evaluation DataSet where the size of the returned prediction Tensors + Model exceed GPU RAM. (In my case I had an evaluation dataset of 469,530 sentences).
  2. Trainer will crash with a CUDA Memory Exception

Expected behavior

  • I would expect the predictions in predict() or evaluate() from each step would be moved to the CPU device (off the GPU) and then concatenated later. However, the tensors are concatenated while still be on the GPU device, and only converted to CPU numpy arrays after the whole dataset has been predicted/evaluated.

This means that for large evaluation datasets you’ll run out of CUDA memory.

It also makes it difficult to pick the batch size to optimize the batch size for the GPU, as you need to allow space for not only the model and inputs, but also all the predictions, which can add up when dealing with large evaluation datasets.

  • ie. The problem in the trainer code is that the predictions stay on the GPU here
  • These tensors get concatenated but stay on the GPU here
  • and then the predictions eventually end up on the CPU here

What this means, is that for larger evaluation datasets, all the predictions stay on the GPU, which is a function of how long your evaluation dataset is, and that you often run out of GPU RAM.

A work around is to something like this and run the loop yourself, and predict by batch:

    preds = []
    for i in tqdm(range(0, len(ds_valid), step)):
        ds_valid = KeywordDataset(tokenizer, df_valid[i:i+step], targets)
        batch_preds = trainer.predict(ds_valid)
        preds.append(batch_preds)
        batch_accuracy = (batch_preds.predictions.argmax(-1) == batch_preds.label_ids).mean()

    np_preds = np.concatenate([pred.predictions for pred in preds], axis=0)
    np_label_ids = np.concatenate([pred.label_ids for pred in preds], axis=0)
    acc = (np_preds.argmax(-1) == np_label_ids).mean()
    print('eval_accuracy = ', acc)

It would be nice if it defaulted the cpu (so you don’t have to worry about it), and that there is another trainer argument to give the device to aggregate predictions that can be used to override that behaviour if it’s important for certain use cases.

Thanks for the amazing work on an amazing library. Working with transformers has never been easier due to the hard work of your team!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:5
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

5reactions
Godwinh19commented, May 12, 2022

In case this solution is not suitable for someone like me, I had the same problem: out of memory error using cuda. But for my dataset, I could stay on the CPU for training. So I wrote my own training function and converted the dataset to a pytorch dataloader and it’s done. Then I can totally control the processor 😃.

4reactions
sguggercommented, Oct 11, 2020

There is a workaround (chunking your dataset in smaller parts) which is why we didn’t implement a quick fix. I’ll start working on the right fix next week and link the PR here once it’s ready @usuyama, so you can see the corresponding part in datasets (I have not yet investigated the exact part in datasets so can’t answer your question right now, I’ve just been told it should be possible 😉 )

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA out of memory when using Trainer with compute_metrics
This error means you are trying to get predictions that just don't fit in RAM, so there is nothing Trainer can do to...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
The issue is, to train the model using GPU, you need the error between the labels and predictions, and for the error, you...
Read more >
Explain Your Machine Learning Model Predictions with GPU ...
SHAP is one such technique used widely in industry to evaluate and explain a model's prediction. This post explains how you can train...
Read more >
PyTorch 101, Part 4: Memory Management and Using Multiple ...
This article covers PyTorch's advanced GPU management features, how to optimise memory usage and best practises for debugging memory errors.
Read more >
Distributed training with TensorFlow
Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found