Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA out of memory After 74 epochs

See original GitHub issue

🐛 Bug

I am using the pytorch-lighting to run on 8 V100GPUs，and I set a seed in my train srcipt, everything went well at beginning, but encountered CUDA out of memory while running the 74th epoch. This can be steadily reproduced since I have run the experiment 4 times with the same result. The detailed log information is posted in the Additional context section.

Expected behavior

It will be great if you can give me some suggestion! Thank you!

Environment

PyTorch Lightning Version : 1.3.0
PyTorch Version : 1.7.1
Python version : 3.8
OS : Linux
CUDA/cuDNN version:11.0
GPU models and configuration: V100 32GB
How you installed PyTorch (conda, pip, source): conda

Additional context

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7 @borda

Issue Analytics

State:
Created a year ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

ZwormZcommented, Apr 27, 2022

Here is the the implementation of forward: During Training, parameter: repr_layers=None, need_head_weights=False, return_contacts=False

    def forward(
        self, tokens, repr_layers=[], need_head_weights=False, return_contacts=False
    ):
        if return_contacts:
            need_head_weights = True

        assert tokens.ndim == 3
        batch_size, num_alignments, seqlen = tokens.size()
        padding_mask = tokens.eq(self.vocab.pad_idx)  # B, R, C
        if not padding_mask.any():
            padding_mask = None
        x = self.embed_tokens(tokens.long())
        # x = self.embed_tokens(tokens)
        x += self.embed_positions(
            tokens.view(batch_size * num_alignments, seqlen)
        ).view(x.size())
        if self.msa_position_embedding is not None:
            if x.size(1) > 1024:
                raise RuntimeError(
                    "Using model with MSA position embedding trained on maximum MSA "
                    f"depth of 1024, but received {x.size(1)} alignments."
                )
            x += self.msa_position_embedding[:, :num_alignments]

        x = self.emb_layer_norm_before(x)

        x = self.dropout_module(x)

        if padding_mask is not None:
            x = x * (1 - padding_mask.unsqueeze(-1).type_as(x))

        repr_layers = set(repr_layers)
        hidden_representations = {}
        if 0 in repr_layers:
            hidden_representations[0] = x

        if need_head_weights:
            row_attn_weights = []
            col_attn_weights = []

        # B x R x C x D -> R x C x B x D
        x = x.permute(1, 2, 0, 3)

        for layer_idx, layer in enumerate(self.layers):
            x = layer(
                x,
                self_attn_padding_mask=padding_mask,
                need_head_weights=need_head_weights,
            )
            if need_head_weights:
                x, col_attn, row_attn = x
                # H x C x B x R x R -> B x H x C x R x R
                col_attn_weights.append(col_attn.permute(2, 0, 1, 3, 4))
                # H x B x C x C -> B x H x C x C
                row_attn_weights.append(row_attn.permute(1, 0, 2, 3))
            if (layer_idx + 1) in repr_layers:
                hidden_representations[layer_idx + 1] = x.permute(2, 0, 1, 3)

        x = self.emb_layer_norm_after(x)
        x = x.permute(2, 0, 1, 3)  # R x C x B x D -> B x R x C x D

        # last hidden representation should have layer norm applied
        if (layer_idx + 1) in repr_layers:
            hidden_representations[layer_idx + 1] = x
        x = self.lm_head(x)

        result = {"logits": x, "representations": hidden_representations}
        if need_head_weights:
            # col_attentions: B x L x H x C x R x R
            col_attentions = torch.stack(col_attn_weights, 1)
            # row_attentions: B x L x H x C x C
            row_attentions = torch.stack(row_attn_weights, 1)
            result["col_attentions"] = col_attentions
            result["row_attentions"] = row_attentions
            if return_contacts:
                contacts = self.contact_head(tokens, row_attentions)
                result["contacts"] = contacts

        return result

0reactions

Huni-MLcommented, Nov 7, 2022

@yuhongyu0721 @ZwormZ Do you have a full script and env detail that reproduces the behaviour?

I figure out my problem is caused by Pytorch and its compilation. It’s fixed by reinstall Pytorch with clean environment.

Top Results From Across the Web

CUDA out of memory After 74 epochs · Issue #12874 - GitHub

I am using the pytorch-lighting to run on 8 V100GPUs，and I set a seed in my train srcipt, everything went well at beginning,...

'CUDA error: out of memory' after several epochs

The strange thing is that this error arises after 7 epochs, so it seems like some GPU memory allocation is not being released....

CUDA out of memory only during validation not training

I'm trying to fine-tune an AraBERT2GPT2 model using the EncoderDecoderModel class on a relatively small dataset. I train only for 1 epoch and ......

CUDA out of memory error training after a few epochs

Hi, I'm having some memory errors when training a GCN model on a gpu, the model runs fine for about 25 epochs and...

RuntimeError: CUDA out of memory in training with pytorch ...

1 Answer 1 ... Your GPU doesn't have enough memory. Try to reduce the batch size. If still the same, try to reduce...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

CUDA out of memory After 74 epochs

🐛 Bug

Expected behavior

Environment

Additional context

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Checkpoint loading patch fails when using a pre-trained featurizer within a Lightning module.

`test` produces a warning when using `DDP`