Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

vargfacenet + arcloss infinite loss

See original GitHub issue

I try to retrain vargfacenet as in the LFR Challenge, however the training is showing infinite loss immediately. Can someone please help? For the information, I train with smaller batch_size and with only one GPU, but I doubt if that is the problem. Log:

CUDA_VISIBLE_DEVICES='0' python -u train.py --network vargfacenet --loss arcface --dataset retina
gpu num: 1
prefix ./models/vargfacenet-arcface-retina/model
image_size [112, 112]
num_classes 93431
Called with argument: Namespace(batch_size=32, ckpt=3, ctx_num=1, dataset='retina', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='vargfacenet', per_batch_size=32, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'bn_mom': 0.9, 'workspace': 256, 'emb_size': 512, 'ckpt_embedding': True, 'net_se': 0, 'net_act': 'prelu', 'net_unit': 3, 'net_input': 1, 'net_blocks': [1, 4, 6, 2], 'net_output': 'J', 'net_multiplier': 1.25, 'val_targets': ['lfw', 'cfp_fp', 'agedb_30'], 'ce_loss': True, 'fc7_lr_mult': 1.0, 'fc7_wd_mult': 1.0, 'fc7_no_bias': False, 'max_steps': 0, 'data_rand_mirror': True, 'data_cutoff': False, 'data_color': 0, 'data_images_filter': 0, 'count_flops': True, 'memonger': False, 'loss_name': 'margin_softmax', 'loss_s': 64.0, 'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_name': 'vargfacenet', 'dataset': 'retina', 'dataset_path': '../datasets/ms1m-retinaface-t1', 'num_classes': 93431, 'image_shape': [112, 112, 3], 'loss': 'arcface', 'network': 'vargfacenet', 'num_workers': 1, 'batch_size': 32, 'per_batch_size': 32}
Network FLOPs: 1.0G
INFO:root:loading recordio ../datasets/ms1m-retinaface-t1/train.rec...
header0 label [5179511. 5272942.]
id2range 93431
5179510
rand_mirror True
[13:21:19] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver lfw
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
(14000, 3, 112, 112)
ver cfp_fp
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
(12000, 3, 112, 112)
ver agedb_30
lr_steps [100000, 160000, 220000]
call reset()
/home/vdx/csenv/lib/python3.7/site-packages/mxnet/module/base_module.py:504: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.03125). Is this intended?
  optimizer_params=optimizer_params)
[13:22:01] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [0-20]	Speed: 62.45 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [20-40]	Speed: 61.56 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [40-60]	Speed: 58.71 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [60-80]	Speed: 44.77 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [80-100]	Speed: 74.33 samples/sec	acc=0.000000	lossvalue=nan
INFO:root:Epoch[0] Batch [100-120]	Speed: 26.56 samples/sec	acc=0.000000	lossvalue=nan

Issue Analytics

State:
Created 4 years ago
Comments:9

Top GitHub Comments

1reaction

doxuanviet1996commented, Dec 3, 2019

Go to the lfr challenge page, they link to the retina dataset.

1reaction

chenghan1995commented, Dec 3, 2019

Thank you for your instant reply. I will try immediately.

Top Results From Across the Web

ArcFace: Additive Angular Margin Loss for Deep Face ... - arXiv

In this paper, we first introduce an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly...

Face Recognition and ArcFace: Additive Angular Margin Loss ...

In this article, you will discover an ArcFace approach, which obtains highly discriminative features for face recognition.

Labeled Faces in the Wild Benchmark (Face Verification)

Rank Model Accuracy Year 1 VarGFaceNet 99.85% 2019 2 ArcFace + MS1MV2 + R100, 99.83% 2018 3 PFEfuse+match 99.82% 2019

How to Choose a Loss Function For Face Recognition

The first half of this article describes loss functions that provide fine-grained control over these two sub-tasks. Unlike a generic classification task, it's ......

VarGFaceNet: An Efficient Variable Group Convolutional ...

To enhance interpretation ability, we employ an equivalence of angular distillation loss to guide our lightweight network and we apply recursive knowledge.