Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory during validation step

See original GitHub issue

I got OOM error during validation step. Here is the log

0/10 * Epoch (train): 100% 32/32 [00:41<00:00,  2.39s/it, _fps=11.486, loss=0.898] 
0/10 * Epoch (valid):   3% 1/32 [00:01<00:46,  1.50s/it, _fps=21.531, loss=1.656]
Traceback (most recent call last):
....
RuntimeError: CUDA error: out of memory

My model uses only 80% of GPU during training. However, in validation step, It is out of memory. That is so weired since I though validation consums less memory than training. I am not sure it is normal or not. But I guess, probably, GPU does not have time to release GPU before going to validation step.

I also tried to add some callback to freeze GPU:

class FreeGPU(Callback):

    def on_stage_start(self, state):
        torch.cuda.empty_cache()

    def on_loader_start(self, state):
        torch.cuda.empty_cache()

    def on_loader_end(self, state):
        torch.cuda.empty_cache()

    def on_stage_end(self, state):
        torch.cuda.empty_cache()

    def on_epoch_start(self, state):
        torch.cuda.empty_cache()

    def on_epoch_end(self, state):
        torch.cuda.empty_cache()

It does not help at all. Do you have any ideas?. P/S: It is not the first time I face to this problem. The only thing I can prevent it is reducing the batch size. But, It hurts the performance.

Issue Analytics

State:
Created 4 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

ngxbaccommented, Apr 15, 2019

Which version do you use? 19.03?

I am using 19.03-rc

PS. please, check that you have this grad check

You are right. I hotfix the library and this error has gone. I think it is the time to update my Catalyst 😢

1reaction

Scitatorcommented, Apr 14, 2019

Which version do you use? 19.03?

PS. please, check that you have this grad check

Read more comments on GitHub >

Top Results From Across the Web

Out of memory error during evaluation but training works fine!

Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine.

Out of memory error when using validation while training a ...

This issue is not a result of the increased training set size. One workaround is to train by splitting the training set into...

Cuda out of memory during evaluation but training is fine

Hi, I am finetuning a BARTForConditionalGeneration model. I am using Trainer from the library to train so I do not use anything fancy....

Cuda out of memory · Discussion #11582 · Lightning ... - GitHub

Wouldn't it have to be outside the 'training_step' (or validation step) function to keep the running total? Also, similar to using self.log, can...

GPU out of memory on evaluation : Pytorch - Stack Overflow

features_all.append((save_features, predicted, targets.data)). This line is saving references to tensors in GPU memory and so the CUDA ...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

[solved] IOU metric (IouCallback) is bigger than 1

Change "just now" label to "now"