Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to Replicate ViT+WebFace42M model results

See original GitHub issue

I tried to replicate the results for ViT Base model trained on WebFace42M but the model does not seem to converge. The loss starts at 53 and stagnates at about 22 after a few epochs of training. I have used the exact same config, with the max. learning rate scaled according to my batch size. I am using 4 GPUs for training and the config variables are as below:

` config.network = “vit_b”

config.embedding_size = 256

#Partial FC config.sample_rate = 1

config.fp16 = True config.batch_size = 1500

#For AdamW config.optimizer = “adamw” config.lr = 0.00025 config.weight_decay = 0.1

config.verbose = 1415 config.dali = False

config.rec = “/media/data/Webface42M_rec” config.num_classes = 2059906 config.num_epoch = 40 config.warmup_epoch = config.num_epoch//10 `

Do you have any insights on why this could be happening?

Any help would be highly appreciated @anxiangsir

Issue Analytics

State:
Created a year ago
Comments:17 (6 by maintainers)

Top GitHub Comments

3reactions

anxiangsircommented, Jul 4, 2022

Hi, jacqueline-weng , I will train ViT-T tonight with the latest code, to check if it can be reproduced.

Training: 2022-07-04 19:49:59,615-: margin_list              [1.0, 0.0, 0.4]
Training: 2022-07-04 19:49:59,616-: network                  vit_t_dp005_mask0
Training: 2022-07-04 19:49:59,616-: resume                   False
Training: 2022-07-04 19:49:59,616-: save_all_states          False
Training: 2022-07-04 19:49:59,616-: output                   work_dirs/wf42m_pfc03_40epoch_8gpu_vit_t
Training: 2022-07-04 19:49:59,616-: embedding_size           512
Training: 2022-07-04 19:49:59,616-: sample_rate              0.3
Training: 2022-07-04 19:49:59,616-: interclass_filtering_threshold0
Training: 2022-07-04 19:49:59,616-: fp16                     True
Training: 2022-07-04 19:49:59,616-: batch_size               512
Training: 2022-07-04 19:49:59,616-: optimizer                adamw
Training: 2022-07-04 19:49:59,616-: lr                       0.001
Training: 2022-07-04 19:49:59,616-: momentum                 0.9
Training: 2022-07-04 19:49:59,616-: weight_decay             0.1
Training: 2022-07-04 19:49:59,616-: verbose                  2000
Training: 2022-07-04 19:49:59,616-: frequent                 10
Training: 2022-07-04 19:49:59,617-: dali                     True
Training: 2022-07-04 19:49:59,617-: seed                     2048
Training: 2022-07-04 19:49:59,617-: num_workers              2
Training: 2022-07-04 19:49:59,617-: rec                      /train_tmp/WebFace42M
Training: 2022-07-04 19:49:59,617-: num_classes              2059906
Training: 2022-07-04 19:49:59,617-: num_image                42474557
Training: 2022-07-04 19:49:59,617-: num_epoch                40
Training: 2022-07-04 19:49:59,617-: warmup_epoch             4
Training: 2022-07-04 19:49:59,617-: val_targets              []
Training: 2022-07-04 19:49:59,617-: total_batch_size         4096
Training: 2022-07-04 19:49:59,617-: warmup_step              41476
Training: 2022-07-04 19:49:59,617-: total_step               414760
Training: 2022-07-04 19:50:02,872-Reducer buckets have been rebuilt in this iteration.
Training: 2022-07-04 19:50:11,077-Speed 8975.71 samples/sec   Loss 42.8994   LearningRate 0.000000   Epoch: 0   Global Step: 20   Fp16 Grad Scale: 65536   Required: 60 hours
Training: 2022-07-04 19:50:15,617-Speed 9025.80 samples/sec   Loss 42.8919   LearningRate 0.000001   Epoch: 0   Global Step: 30   Fp16 Grad Scale: 65536   Required: 56 hours
Training: 2022-07-04 19:50:20,164-Speed 9008.30 samples/sec   Loss 42.8912   LearningRate 0.000001   Epoch: 0   Global Step: 40   Fp16 Grad Scale: 65536   Required: 56 hours
Training: 2022-07-04 19:50:24,703-Speed 9027.74 samples/sec   Loss 42.8725   LearningRate 0.000001   Epoch: 0   Global Step: 50   Fp16 Grad Scale: 65536   Required: 56 hours
Training: 2022-07-04 19:50:29,265-Speed 8980.16 samples/sec   Loss 42.8666   LearningRate 0.000001   Epoch: 0   Global Step: 60   Fp16 Grad Scale: 65536   Required: 55 hours
Training: 2022-07-04 19:50:33,809-Speed 9018.63 samples/sec   Loss 42.8707   LearningRate 0.000002   Epoch: 0   Global Step: 70   Fp16 Grad Scale: 65536   Required: 55 hours
Training: 2022-07-04 19:50:38,347-Speed 9026.76 samples/sec   Loss 42.8631   LearningRate 0.000002   Epoch: 0   Global Step: 80   Fp16 Grad Scale: 65536   Required: 54 hours
Training: 2022-07-04 19:50:42,909-Speed 8980.33 samples/sec   Loss 42.7995   LearningRate 0.000002   Epoch: 0   Global Step: 90   Fp16 Grad Scale: 65536   Required: 54 hours
Training: 2022-07-04 19:50:47,477-Speed 8969.62 samples/sec   Loss 42.7913   LearningRate 0.000002   Epoch: 0   Global Step: 100   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:50:52,046-Speed 8966.48 samples/sec   Loss 42.7773   LearningRate 0.000003   Epoch: 0   Global Step: 110   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:50:56,611-Speed 8973.38 samples/sec   Loss 42.7429   LearningRate 0.000003   Epoch: 0   Global Step: 120   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:01,163-Speed 9001.72 samples/sec   Loss 42.6983   LearningRate 0.000003   Epoch: 0   Global Step: 130   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:51:05,709-Speed 9011.11 samples/sec   Loss 42.6910   LearningRate 0.000003   Epoch: 0   Global Step: 140   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:51:10,250-Speed 9023.66 samples/sec   Loss 42.6167   LearningRate 0.000004   Epoch: 0   Global Step: 150   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:14,790-Speed 9025.02 samples/sec   Loss 42.5821   LearningRate 0.000004   Epoch: 0   Global Step: 160   Fp16 Grad Scale: 131072   Required: 54 hours
Training: 2022-07-04 19:51:19,324-Speed 9035.34 samples/sec   Loss 42.5213   LearningRate 0.000004   Epoch: 0   Global Step: 170   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:23,878-Speed 8997.40 samples/sec   Loss 42.4732   LearningRate 0.000004   Epoch: 0   Global Step: 180   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:28,417-Speed 9024.36 samples/sec   Loss 42.4071   LearningRate 0.000005   Epoch: 0   Global Step: 190   Fp16 Grad Scale: 131072   Required: 53 hours
Training: 2022-07-04 19:51:32,965-Speed 9008.65 samples/sec   Loss 42.3038   LearningRate 0.000005   Epoch: 0   Global Step: 200   Fp16 Grad Scale: 262144   Required: 53 hours
Training: 2022-07-04 19:51:37,510-Speed 9014.58 samples/sec   Loss 42.2272   LearningRate 0.000005   Epoch: 0   Global Step: 210   Fp16 Grad Scale: 262144   Required: 53 hours
Training: 2022-07-04 19:51:42,061-Speed 9001.34 samples/sec   Loss 42.1281   LearningRate 0.000005   Epoch: 0   Global Step: 220   Fp16 Grad Scale: 262144   Required: 53 hours
Training: 2022-07-04 19:51:46,620-Speed 8987.33 samples/sec   Loss 42.0046   LearningRate 0.000006   Epoch: 0   Global Step: 230   Fp16 Grad Scale: 262144   Required: 53 hours

This is my server configs: 8 * 32GB V100

2reactions

abdikaiym01commented, Jul 6, 2022

Hi @anxiangsir I noticed when learning on your framework ( recognition/arcface_torch ), the loss jumps after each epoch, what do you think it could be from? Maybe because DistributedSampler ( utils.utils_distributed_sampler ) is not working correctly?