Unable to Replicate ViT+WebFace42M model results
See original GitHub issueI tried to replicate the results for ViT Base model trained on WebFace42M but the model does not seem to converge. The loss starts at 53 and stagnates at about 22 after a few epochs of training. I have used the exact same config, with the max. learning rate scaled according to my batch size. I am using 4 GPUs for training and the config variables are as below:
` config.network = “vit_b”
config.embedding_size = 256
#Partial FC config.sample_rate = 1
config.fp16 = True config.batch_size = 1500
#For AdamW config.optimizer = “adamw” config.lr = 0.00025 config.weight_decay = 0.1
config.verbose = 1415 config.dali = False
config.rec = “/media/data/Webface42M_rec” config.num_classes = 2059906 config.num_epoch = 40 config.warmup_epoch = config.num_epoch//10 `
Do you have any insights on why this could be happening?
Any help would be highly appreciated @anxiangsir
Issue Analytics
- State:
- Created a year ago
- Comments:17 (6 by maintainers)
Top GitHub Comments
Hi, jacqueline-weng , I will train ViT-T tonight with the latest code, to check if it can be reproduced.
This is my server configs: 8 * 32GB V100
Hi @anxiangsir I noticed when learning on your framework ( recognition/arcface_torch ), the loss jumps after each epoch, what do you think it could be from? Maybe because DistributedSampler ( utils.utils_distributed_sampler ) is not working correctly?