question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOM when implementing training_epoch_end

See original GitHub issue

Bug description

See the toy example:

import pytorch_lightning as pl
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader, Dataset


class SomeModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 2)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

    def forward(self):
        batch_size, features = 10240, 24000
        x_hat = torch.randn(batch_size, features, requires_grad=True, device=self.device)
        x = torch.randn(batch_size, features, device=self.device)
        loss = F.mse_loss(x_hat, x)
        self.log("train_loss", loss)
        return {
            "loss": loss,
            "x": x,
        }

    def training_step(self, batch, batch_idx):
        return self()

    def training_epoch_end(self, outputs) -> None:
        # do nothing
        ...


class SomeDataset(Dataset):
    def __len__(self):
        return 100000000

    def __getitem__(self, index):
        return "some sample"


train = SomeDataset()
val = SomeDataset()

model = SomeModel()
trainer = pl.Trainer(devices=1, accelerator="gpu", check_val_every_n_epoch=1, )
trainer.fit(model, DataLoader(train), DataLoader(val))

Run it, and we can get an OOM error.

How to reproduce the bug

see the above.

Error messages and logs

RuntimeError: CUDA out of memory.

Environment

PyTorch Lightning Version : 1.7.7

More info

How to fix it: the bug comes from the method advance of training_epoch_loop. WRONG

~~see https://github.com/Lightning-AI/lightning/blob/1.7.7/src/pytorch_lightning/loops/epoch/training_epoch_loop.py#L213~~

when we save the batch_end_outputs for use, we should detach every tensors in the output.

cc @carmocca @justusschock @rohitgr7

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:15 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
carmoccacommented, Nov 4, 2022

This is expected. When you override training_epoch_end, we store all batch outputs for the hook.

We don’t want to detach because the user might want the grads.

If you don’t want to do this, but want an epoch end hook, use on_train_epoch_end which does not have this behaviour

1reaction
WrRancommented, Dec 6, 2022

So what about checking the signature of the overridden training_epoch_end? When users do not need the outputs, skip the L227.

Actually, I use training_epoch_end to do some logs as followings:

def training_epoch_end(self, outputs: EPOCH_OUTPUT) -> None:
    self.log("train_acc", self.train_accuracy.compute(), sync_dist=True)
    self.log("train_bleu_1", self.train_bleu_1.compute(), sync_dist=True)
    self.log("train_bleu_2", self.train_bleu_2.compute(), sync_dist=True)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Weird CUDA out of memory error (OOM) at epoch end
I am facing a very weird OOM error with my current training. The OOM error happens systematically during the forward pass of the...
Read more >
How to make sure the training phase won't be facing an OOM?
In such cases, the training phase will never start. And the solution to fix this is to use smaller batch sizes.
Read more >
OOM Resource Exhausted after one epoch · Issue #29 - GitHub
Hi! I trained your model with Google Colab, and after an entire epoch of trained a get this error Traceback (most recent call...
Read more >
Solving Out Of Memory (OOM) Errors on Keras and ... - LinkedIn
OOM (Out Of Memory) errors can occur when building and training a neural network model on the GPU. The size of the model...
Read more >
CUDA OOM Error, Memory Allocation Keeps Increasing Every ...
The GPU memory always fills up at the 6th epoch no matter the batch_size value or whatever else I try. What I tried:...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found