question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA out of memory After 74 epochs

See original GitHub issue

🐛 Bug

I am using the pytorch-lighting to run on 8 V100GPUs,and I set a seed in my train srcipt, everything went well at beginning, but encountered CUDA out of memory while running the 74th epoch. This can be steadily reproduced since I have run the experiment 4 times with the same result. The detailed log information is posted in the Additional context section.

Expected behavior

It will be great if you can give me some suggestion! Thank you!

Environment

  • PyTorch Lightning Version : 1.3.0
  • PyTorch Version : 1.7.1
  • Python version : 3.8
  • OS : Linux
  • CUDA/cuDNN version:11.0
  • GPU models and configuration: V100 32GB
  • How you installed PyTorch (conda, pip, source): conda

Additional context

image

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7 @borda

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
ZwormZcommented, Apr 27, 2022

Here is the the implementation of forward: During Training, parameter: repr_layers=None, need_head_weights=False, return_contacts=False

    def forward(
        self, tokens, repr_layers=[], need_head_weights=False, return_contacts=False
    ):
        if return_contacts:
            need_head_weights = True

        assert tokens.ndim == 3
        batch_size, num_alignments, seqlen = tokens.size()
        padding_mask = tokens.eq(self.vocab.pad_idx)  # B, R, C
        if not padding_mask.any():
            padding_mask = None
        x = self.embed_tokens(tokens.long())
        # x = self.embed_tokens(tokens)
        x += self.embed_positions(
            tokens.view(batch_size * num_alignments, seqlen)
        ).view(x.size())
        if self.msa_position_embedding is not None:
            if x.size(1) > 1024:
                raise RuntimeError(
                    "Using model with MSA position embedding trained on maximum MSA "
                    f"depth of 1024, but received {x.size(1)} alignments."
                )
            x += self.msa_position_embedding[:, :num_alignments]

        x = self.emb_layer_norm_before(x)

        x = self.dropout_module(x)

        if padding_mask is not None:
            x = x * (1 - padding_mask.unsqueeze(-1).type_as(x))

        repr_layers = set(repr_layers)
        hidden_representations = {}
        if 0 in repr_layers:
            hidden_representations[0] = x

        if need_head_weights:
            row_attn_weights = []
            col_attn_weights = []

        # B x R x C x D -> R x C x B x D
        x = x.permute(1, 2, 0, 3)

        for layer_idx, layer in enumerate(self.layers):
            x = layer(
                x,
                self_attn_padding_mask=padding_mask,
                need_head_weights=need_head_weights,
            )
            if need_head_weights:
                x, col_attn, row_attn = x
                # H x C x B x R x R -> B x H x C x R x R
                col_attn_weights.append(col_attn.permute(2, 0, 1, 3, 4))
                # H x B x C x C -> B x H x C x C
                row_attn_weights.append(row_attn.permute(1, 0, 2, 3))
            if (layer_idx + 1) in repr_layers:
                hidden_representations[layer_idx + 1] = x.permute(2, 0, 1, 3)

        x = self.emb_layer_norm_after(x)
        x = x.permute(2, 0, 1, 3)  # R x C x B x D -> B x R x C x D

        # last hidden representation should have layer norm applied
        if (layer_idx + 1) in repr_layers:
            hidden_representations[layer_idx + 1] = x
        x = self.lm_head(x)

        result = {"logits": x, "representations": hidden_representations}
        if need_head_weights:
            # col_attentions: B x L x H x C x R x R
            col_attentions = torch.stack(col_attn_weights, 1)
            # row_attentions: B x L x H x C x C
            row_attentions = torch.stack(row_attn_weights, 1)
            result["col_attentions"] = col_attentions
            result["row_attentions"] = row_attentions
            if return_contacts:
                contacts = self.contact_head(tokens, row_attentions)
                result["contacts"] = contacts

        return result

0reactions
Huni-MLcommented, Nov 7, 2022

@yuhongyu0721 @ZwormZ Do you have a full script and env detail that reproduces the behaviour?

I figure out my problem is caused by Pytorch and its compilation. It’s fixed by reinstall Pytorch with clean environment.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA out of memory After 74 epochs · Issue #12874 - GitHub
I am using the pytorch-lighting to run on 8 V100GPUs,and I set a seed in my train srcipt, everything went well at beginning,...
Read more >
'CUDA error: out of memory' after several epochs
The strange thing is that this error arises after 7 epochs, so it seems like some GPU memory allocation is not being released....
Read more >
CUDA out of memory only during validation not training
I'm trying to fine-tune an AraBERT2GPT2 model using the EncoderDecoderModel class on a relatively small dataset. I train only for 1 epoch and ......
Read more >
CUDA out of memory error training after a few epochs
Hi, I'm having some memory errors when training a GCN model on a gpu, the model runs fine for about 25 epochs and...
Read more >
RuntimeError: CUDA out of memory in training with pytorch ...
1 Answer 1 ... Your GPU doesn't have enough memory. Try to reduce the batch size. If still the same, try to reduce...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found