question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loss NAN for Deit Base

See original GitHub issue

I have reproduced the small and tiny model but met with problems for reproducing the base model with 224 and 384 image size. With a large probability, the loss came to NAN after training with few epochs. My setting is 16 GPUs and the batch size is 64 on each GPU and I do not change any hyper-parameters in run_with_submitit.py . Do you have any idea to solve this problem? Thanks for your help.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:6
  • Comments:24 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
vtddgggcommented, Jan 11, 2021

Give a simple solution for my case. I found the transformer training is sensitive to the learning rate. We must keep a vary small learning rate in the training stage (it would be best for lr < 0.0015). Otherwise the gradient will become nan caused by amp. So, you can first try to reduce the learning rate.

Another alternative way is to dis-enable the amp. I have tried this way and it also works for me.

3reactions
Andy1621commented, Aug 6, 2021

Without AMP, the loss will not become NAN. However, it will run very slowly for training. I have found that the loss becomes NAN in attention, and simply use FP32 for attention will solve the problem. In my experiments, replacing the attention block with the followed code, the model can be resumed normally. (sometimes you will need to change random seed)

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
        self.scale = qk_scale or head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)

        with torch.cuda.amp.autocast(enabled=False):
            q, k, v = qkv[0].float(), qkv[1].float(), qkv[2].float()   # make torchscript happy (cannot use tensor as tuple)
            attn = (q @ k.transpose(-2, -1)) * self.scale
            attn = attn.softmax(dim=-1)
            attn = self.attn_drop(attn)
            x = (attn @ v).transpose(1, 2).reshape(B, N, C)

        x = self.proj(x)
        x = self.proj_drop(x)
        return x

I hope everyone who meets NAN can try this code and let me know if it works.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NaN loss when training regression network - Stack Overflow
In my case, I use the log value of density estimation as an input. The absolute value could be very huge, which may...
Read more >
Training Stronger Vision Transformers Calls for Reducing All ...
1We disable the repeated augmentation in DeiT-Base's [54] training schemes due to the well-known loss NAN issue of the original imple-.
Read more >
arXiv:2203.06345v1 [cs.LG] 12 Mar 2022
DeiT -Base contain 12 layers, while DeiT-Small24 has 24 layers. ... schemes due to the well-known loss NAN issue of the original imple-....
Read more >
A complete Hugging Face tutorial: how to build and train a ...
Learn about the Hugging Face ecosystem with a hands-on tutorial on the datasets and transformers library. Explore how to fine tune a Vision ......
Read more >
Major Weight Loss Diet - Heart2Heart First Aid & CPR/AED Training
As for its cultivation Major Weight Loss Diet base, it was here. ... Seeing that Li Nan after Major Weight Loss Diet swallowing...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found