Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loss NAN for Deit Base

See original GitHub issue

I have reproduced the small and tiny model but met with problems for reproducing the base model with 224 and 384 image size. With a large probability, the loss came to NAN after training with few epochs. My setting is 16 GPUs and the batch size is 64 on each GPU and I do not change any hyper-parameters in run_with_submitit.py . Do you have any idea to solve this problem? Thanks for your help.

Issue Analytics

State:
Created 3 years ago
Reactions:6
Comments:24 (3 by maintainers)

Top GitHub Comments

4reactions

vtddgggcommented, Jan 11, 2021

Give a simple solution for my case. I found the transformer training is sensitive to the learning rate. We must keep a vary small learning rate in the training stage (it would be best for lr < 0.0015). Otherwise the gradient will become nan caused by amp. So, you can first try to reduce the learning rate.

Another alternative way is to dis-enable the amp. I have tried this way and it also works for me.

3reactions

Andy1621commented, Aug 6, 2021

Without AMP, the loss will not become NAN. However, it will run very slowly for training. I have found that the loss becomes NAN in attention, and simply use FP32 for attention will solve the problem. In my experiments, replacing the attention block with the followed code, the model can be resumed normally. (sometimes you will need to change random seed)

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
        self.scale = qk_scale or head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)

        with torch.cuda.amp.autocast(enabled=False):
            q, k, v = qkv[0].float(), qkv[1].float(), qkv[2].float()   # make torchscript happy (cannot use tensor as tuple)
            attn = (q @ k.transpose(-2, -1)) * self.scale
            attn = attn.softmax(dim=-1)
            attn = self.attn_drop(attn)
            x = (attn @ v).transpose(1, 2).reshape(B, N, C)

        x = self.proj(x)
        x = self.proj_drop(x)
        return x

I hope everyone who meets NAN can try this code and let me know if it works.