OOM when implementing training_epoch_end
See original GitHub issueBug description
See the toy example:
import pytorch_lightning as pl
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader, Dataset
class SomeModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1, 2)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)
def forward(self):
batch_size, features = 10240, 24000
x_hat = torch.randn(batch_size, features, requires_grad=True, device=self.device)
x = torch.randn(batch_size, features, device=self.device)
loss = F.mse_loss(x_hat, x)
self.log("train_loss", loss)
return {
"loss": loss,
"x": x,
}
def training_step(self, batch, batch_idx):
return self()
def training_epoch_end(self, outputs) -> None:
# do nothing
...
class SomeDataset(Dataset):
def __len__(self):
return 100000000
def __getitem__(self, index):
return "some sample"
train = SomeDataset()
val = SomeDataset()
model = SomeModel()
trainer = pl.Trainer(devices=1, accelerator="gpu", check_val_every_n_epoch=1, )
trainer.fit(model, DataLoader(train), DataLoader(val))
Run it, and we can get an OOM error.
How to reproduce the bug
see the above.
Error messages and logs
RuntimeError: CUDA out of memory.
Environment
PyTorch Lightning Version : 1.7.7
More info
How to fix it:
the bug comes from the method WRONGadvance
of training_epoch_loop
.
when we save the batch_end_outputs
for use, we should detach every tensors in the output.
Issue Analytics
- State:
- Created a year ago
- Comments:15 (13 by maintainers)
Top Results From Across the Web
Weird CUDA out of memory error (OOM) at epoch end
I am facing a very weird OOM error with my current training. The OOM error happens systematically during the forward pass of the...
Read more >How to make sure the training phase won't be facing an OOM?
In such cases, the training phase will never start. And the solution to fix this is to use smaller batch sizes.
Read more >OOM Resource Exhausted after one epoch · Issue #29 - GitHub
Hi! I trained your model with Google Colab, and after an entire epoch of trained a get this error Traceback (most recent call...
Read more >Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...
Read more >CUDA OOM Error, Memory Allocation Keeps Increasing Every ...
The GPU memory always fills up at the 6th epoch no matter the batch_size value or whatever else I try. What I tried:...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This is expected. When you override
training_epoch_end
, we store all batch outputs for the hook.We don’t want to detach because the user might want the grads.
If you don’t want to do this, but want an epoch end hook, use
on_train_epoch_end
which does not have this behaviourSo what about checking the signature of the overridden
training_epoch_end
? When users do not need theoutputs
, skip the L227.Actually, I use
training_epoch_end
to do some logs as followings: