Out of memory during validation step
See original GitHub issueI got OOM error during validation step. Here is the log
0/10 * Epoch (train): 100% 32/32 [00:41<00:00, 2.39s/it, _fps=11.486, loss=0.898]
0/10 * Epoch (valid): 3% 1/32 [00:01<00:46, 1.50s/it, _fps=21.531, loss=1.656]
Traceback (most recent call last):
....
RuntimeError: CUDA error: out of memory
My model uses only 80% of GPU during training. However, in validation step, It is out of memory. That is so weired since I though validation consums less memory than training. I am not sure it is normal or not. But I guess, probably, GPU does not have time to release GPU before going to validation step.
I also tried to add some callback to freeze GPU:
class FreeGPU(Callback):
def on_stage_start(self, state):
torch.cuda.empty_cache()
def on_loader_start(self, state):
torch.cuda.empty_cache()
def on_loader_end(self, state):
torch.cuda.empty_cache()
def on_stage_end(self, state):
torch.cuda.empty_cache()
def on_epoch_start(self, state):
torch.cuda.empty_cache()
def on_epoch_end(self, state):
torch.cuda.empty_cache()
It does not help at all. Do you have any ideas?. P/S: It is not the first time I face to this problem. The only thing I can prevent it is reducing the batch size. But, It hurts the performance.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Out of memory error during evaluation but training works fine!
Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine.
Read more >Out of memory error when using validation while training a ...
This issue is not a result of the increased training set size. One workaround is to train by splitting the training set into...
Read more >Cuda out of memory during evaluation but training is fine
Hi, I am finetuning a BARTForConditionalGeneration model. I am using Trainer from the library to train so I do not use anything fancy....
Read more >Cuda out of memory · Discussion #11582 · Lightning ... - GitHub
Wouldn't it have to be outside the 'training_step' (or validation step) function to keep the running total? Also, similar to using self.log, can...
Read more >GPU out of memory on evaluation : Pytorch - Stack Overflow
features_all.append((save_features, predicted, targets.data)). This line is saving references to tensors in GPU memory and so the CUDA ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I am using
19.03-rc
You are right. I hotfix the library and this error has gone. I think it is the time to update my
Catalyst
😢Which version do you use? 19.03?
PS. please, check that you have this grad check