question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training logs for Swin-B/S/L

See original GitHub issue

Hi authors,

Following the official training command below, I observed unstable training loss and accuracies around epoch # 20.

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py \ --cfg configs/swin_base_patch4_window7_224.yaml --data-path <imagenet-path> --batch-size 64 \ --accumulation-steps 2 [--use-checkpoint]

Can you please share the training logs for Swin-B? And if more logs are available, please consider sharing them as well.

TIA!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
zeliu98commented, Jan 1, 2022

Hi @lcmeng, the model is trained using the default mixed-precision (O1). We doesn’t deal with the logging of amp so the loss scaling info is not wrote to the log file.

Your environment seems correct and I am not very clear about the reason about your problem. According to other users’ feedback, installing apex from source may be error-prone, so I just suspect that there might be something wrong with your apex. The nvcr-21.05 docker has installed apex by itself, so you can try it first. The version of CUDA and pytorch is ok and I have checked.

Besides, you can share me with your log so I can look into it further.

0reactions
lcmengcommented, Jan 28, 2022

@zeliu98, thank you for the explanation. I’ve added some TensorBoard code to Swin to generate visualization of the training. It seems the drop of accuracy near the peak LR is correlated with the explosion of gradient (norm).

Please see the attached screenshots. The LR appeared to be doubled, due to accumulation step = 2. It is in fact the same as the recommended setup.

(1) The trace of gradient norm over global steps. It increased very aggressively after the initial “flat” phase. Screen Shot 2022-01-26 at 12 46 44 PM

(2) Max top-1 accuracy happened at Epoch 15. Screen Shot 2022-01-26 at 12 51 40 PM

(3) Using the recommended LR schedule Screen Shot 2022-01-26 at 12 51 02 PM

Read more comments on GitHub >

github_iconTop Results From Across the Web

MB SwingBall Trainer at Mike Bender Golf Training Aids
Clenching the MB SwingBall Trainer between the forearms trains the player to keep their arms together during the swing.
Read more >
swingball replacement ball
Tourna Ball And String Replacement for Tennis Trainers - universal fit · Silfrae Solo Tennis Trainer Tennis Rebounder Self Practice Tennis Training Tool...
Read more >
PowerMax Swing Ball
PowerMax Swing Ball. For swinging, slamming, and rotation drills. Typically ships within 1-2 business days. $93.00.
Read more >
Tennis Swingball
Shop for Tennis Swingball at Walmart.com. Save money. Live better. ... Tennis training Swingball Replacement Tether Rope String Adjustable.
Read more >
Swingball All Surface Pro
TETHER TENNIS: Swingball is the award winning portable tether tennis game that's easy to set up with its all surface base & compact...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found