question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

vargfacenet + arcloss infinite loss

See original GitHub issue

I try to retrain vargfacenet as in the LFR Challenge, however the training is showing infinite loss immediately. Can someone please help? For the information, I train with smaller batch_size and with only one GPU, but I doubt if that is the problem. Log:

CUDA_VISIBLE_DEVICES='0' python -u train.py --network vargfacenet --loss arcface --dataset retina
gpu num: 1
prefix ./models/vargfacenet-arcface-retina/model
image_size [112, 112]
num_classes 93431
Called with argument: Namespace(batch_size=32, ckpt=3, ctx_num=1, dataset='retina', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='vargfacenet', per_batch_size=32, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'bn_mom': 0.9, 'workspace': 256, 'emb_size': 512, 'ckpt_embedding': True, 'net_se': 0, 'net_act': 'prelu', 'net_unit': 3, 'net_input': 1, 'net_blocks': [1, 4, 6, 2], 'net_output': 'J', 'net_multiplier': 1.25, 'val_targets': ['lfw', 'cfp_fp', 'agedb_30'], 'ce_loss': True, 'fc7_lr_mult': 1.0, 'fc7_wd_mult': 1.0, 'fc7_no_bias': False, 'max_steps': 0, 'data_rand_mirror': True, 'data_cutoff': False, 'data_color': 0, 'data_images_filter': 0, 'count_flops': True, 'memonger': False, 'loss_name': 'margin_softmax', 'loss_s': 64.0, 'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_name': 'vargfacenet', 'dataset': 'retina', 'dataset_path': '../datasets/ms1m-retinaface-t1', 'num_classes': 93431, 'image_shape': [112, 112, 3], 'loss': 'arcface', 'network': 'vargfacenet', 'num_workers': 1, 'batch_size': 32, 'per_batch_size': 32}
Network FLOPs: 1.0G
INFO:root:loading recordio ../datasets/ms1m-retinaface-t1/train.rec...
header0 label [5179511. 5272942.]
id2range 93431
5179510
rand_mirror True
[13:21:19] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver lfw
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
(14000, 3, 112, 112)
ver cfp_fp
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver agedb_30
lr_steps [100000, 160000, 220000]
call reset()
/home/vdx/csenv/lib/python3.7/site-packages/mxnet/module/base_module.py:504: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.03125). Is this intended?
  optimizer_params=optimizer_params)
[13:22:01] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [0-20]	Speed: 62.45 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [20-40]	Speed: 61.56 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [40-60]	Speed: 58.71 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [60-80]	Speed: 44.77 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [80-100]	Speed: 74.33 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [100-120]	Speed: 26.56 samples/sec	acc=0.000000	lossvalue=nan

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
doxuanviet1996commented, Dec 3, 2019

Go to the lfr challenge page, they link to the retina dataset.

1reaction
chenghan1995commented, Dec 3, 2019

Thank you for your instant reply. I will try immediately.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ArcFace: Additive Angular Margin Loss for Deep Face ... - arXiv
In this paper, we first introduce an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly...
Read more >
Face Recognition and ArcFace: Additive Angular Margin Loss ...
In this article, you will discover an ArcFace approach, which obtains highly discriminative features for face recognition.
Read more >
Labeled Faces in the Wild Benchmark (Face Verification)
Rank Model Accuracy Year 1 VarGFaceNet 99.85% 2019 2 ArcFace + MS1MV2 + R100, 99.83% 2018 3 PFEfuse+match 99.82% 2019
Read more >
How to Choose a Loss Function For Face Recognition
The first half of this article describes loss functions that provide fine-grained control over these two sub-tasks. Unlike a generic classification task, it's ......
Read more >
VarGFaceNet: An Efficient Variable Group Convolutional ...
To enhance interpretation ability, we employ an equivalence of angular distillation loss to guide our lightweight network and we apply recursive knowledge.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found