question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory during validation step

See original GitHub issue

I got OOM error during validation step. Here is the log

0/10 * Epoch (train): 100% 32/32 [00:41<00:00,  2.39s/it, _fps=11.486, loss=0.898] 
0/10 * Epoch (valid):   3% 1/32 [00:01<00:46,  1.50s/it, _fps=21.531, loss=1.656]
Traceback (most recent call last):
....
RuntimeError: CUDA error: out of memory

My model uses only 80% of GPU during training. However, in validation step, It is out of memory. That is so weired since I though validation consums less memory than training. I am not sure it is normal or not. But I guess, probably, GPU does not have time to release GPU before going to validation step.

I also tried to add some callback to freeze GPU:

class FreeGPU(Callback):

    def on_stage_start(self, state):
        torch.cuda.empty_cache()

    def on_loader_start(self, state):
        torch.cuda.empty_cache()

    def on_loader_end(self, state):
        torch.cuda.empty_cache()

    def on_stage_end(self, state):
        torch.cuda.empty_cache()

    def on_epoch_start(self, state):
        torch.cuda.empty_cache()

    def on_epoch_end(self, state):
        torch.cuda.empty_cache()

It does not help at all. Do you have any ideas?. P/S: It is not the first time I face to this problem. The only thing I can prevent it is reducing the batch size. But, It hurts the performance.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ngxbaccommented, Apr 15, 2019

Which version do you use? 19.03?

I am using 19.03-rc

PS. please, check that you have this grad check

You are right. I hotfix the library and this error has gone. I think it is the time to update my Catalyst 😢

1reaction
Scitatorcommented, Apr 14, 2019

Which version do you use? 19.03?

PS. please, check that you have this grad check

Read more comments on GitHub >

github_iconTop Results From Across the Web

Out of memory error during evaluation but training works fine!
Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine.
Read more >
Out of memory error when using validation while training a ...
This issue is not a result of the increased training set size. One workaround is to train by splitting the training set into...
Read more >
Cuda out of memory during evaluation but training is fine
Hi, I am finetuning a BARTForConditionalGeneration model. I am using Trainer from the library to train so I do not use anything fancy....
Read more >
Cuda out of memory · Discussion #11582 · Lightning ... - GitHub
Wouldn't it have to be outside the 'training_step' (or validation step) function to keep the running total? Also, similar to using self.log, can...
Read more >
GPU out of memory on evaluation : Pytorch - Stack Overflow
features_all.append((save_features, predicted, targets.data)). This line is saving references to tensors in GPU memory and so the CUDA ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found