Training logs for Swin-B/S/L
See original GitHub issueHi authors,
Following the official training command below, I observed unstable training loss and accuracies around epoch # 20.
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py \ --cfg configs/swin_base_patch4_window7_224.yaml --data-path <imagenet-path> --batch-size 64 \ --accumulation-steps 2 [--use-checkpoint]
Can you please share the training logs for Swin-B? And if more logs are available, please consider sharing them as well.
TIA!
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (4 by maintainers)
Top Results From Across the Web
MB SwingBall Trainer at Mike Bender Golf Training Aids
Clenching the MB SwingBall Trainer between the forearms trains the player to keep their arms together during the swing.
Read more >swingball replacement ball
Tourna Ball And String Replacement for Tennis Trainers - universal fit · Silfrae Solo Tennis Trainer Tennis Rebounder Self Practice Tennis Training Tool...
Read more >PowerMax Swing Ball
PowerMax Swing Ball. For swinging, slamming, and rotation drills. Typically ships within 1-2 business days. $93.00.
Read more >Tennis Swingball
Shop for Tennis Swingball at Walmart.com. Save money. Live better. ... Tennis training Swingball Replacement Tether Rope String Adjustable.
Read more >Swingball All Surface Pro
TETHER TENNIS: Swingball is the award winning portable tether tennis game that's easy to set up with its all surface base & compact...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @lcmeng, the model is trained using the default mixed-precision (O1). We doesn’t deal with the logging of amp so the loss scaling info is not wrote to the log file.
Your environment seems correct and I am not very clear about the reason about your problem. According to other users’ feedback, installing apex from source may be error-prone, so I just suspect that there might be something wrong with your apex. The nvcr-21.05 docker has installed apex by itself, so you can try it first. The version of CUDA and pytorch is ok and I have checked.
Besides, you can share me with your log so I can look into it further.
@zeliu98, thank you for the explanation. I’ve added some TensorBoard code to Swin to generate visualization of the training. It seems the drop of accuracy near the peak LR is correlated with the explosion of gradient (norm).
Please see the attached screenshots. The LR appeared to be doubled, due to accumulation step = 2. It is in fact the same as the recommended setup.
(1) The trace of gradient norm over global steps. It increased very aggressively after the initial “flat” phase.
(2) Max top-1 accuracy happened at Epoch 15.
(3) Using the recommended LR schedule