Loss NAN for Deit Base
See original GitHub issueI have reproduced the small and tiny model but met with problems for reproducing the base model with 224 and 384 image size. With a large probability, the loss came to NAN after training with few epochs.
My setting is 16 GPUs and the batch size is 64 on each GPU and I do not change any hyper-parameters in run_with_submitit.py
. Do you have any idea to solve this problem?
Thanks for your help.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:6
- Comments:24 (3 by maintainers)
Top Results From Across the Web
NaN loss when training regression network - Stack Overflow
In my case, I use the log value of density estimation as an input. The absolute value could be very huge, which may...
Read more >Training Stronger Vision Transformers Calls for Reducing All ...
1We disable the repeated augmentation in DeiT-Base's [54] training schemes due to the well-known loss NAN issue of the original imple-.
Read more >arXiv:2203.06345v1 [cs.LG] 12 Mar 2022
DeiT -Base contain 12 layers, while DeiT-Small24 has 24 layers. ... schemes due to the well-known loss NAN issue of the original imple-....
Read more >A complete Hugging Face tutorial: how to build and train a ...
Learn about the Hugging Face ecosystem with a hands-on tutorial on the datasets and transformers library. Explore how to fine tune a Vision ......
Read more >Major Weight Loss Diet - Heart2Heart First Aid & CPR/AED Training
As for its cultivation Major Weight Loss Diet base, it was here. ... Seeing that Li Nan after Major Weight Loss Diet swallowing...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Give a simple solution for my case. I found the transformer training is sensitive to the learning rate. We must keep a vary small learning rate in the training stage (it would be best for lr < 0.0015). Otherwise the gradient will become
nan
caused byamp
. So, you can first try to reduce the learning rate.Another alternative way is to dis-enable the
amp
. I have tried this way and it also works for me.Without AMP, the loss will not become NAN. However, it will run very slowly for training. I have found that the loss becomes NAN in attention, and simply use FP32 for attention will solve the problem. In my experiments, replacing the attention block with the followed code, the model can be resumed normally. (sometimes you will need to change random seed)
I hope everyone who meets NAN can try this code and let me know if it works.