Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

'loss: nan' error while training with standard yolo_loss

See original GitHub issue

Hi David,

I just want to report a glitch in my experiments… I am training models (my own dataset = 27,000 annotations, 1 class) with the following cmd line:

python3 train.py --model_type yolo3_mobilenetv2_lite --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --save_eval_checkpoint --batch_size 16 --eval_online --eval_epoch_interval 3 --transfer_epoch 2 --freeze_level 1 --total_epoch 20

This is just an example, I tried a half dozen combinations of backbones and heads… Out of 10 trials, I only managed to reach epoch=20 twice. In the other cases, at some point (usually around epoch 4 to 9) I get a crash with this typical message:

705/1106 [==================>...........] - ETA: 7:54 - loss: 9.8939 - location_loss: 3.5176 - confidence_loss: 4.8495 - class_loss: 0.0014Batch 705: Invalid loss, terminating training

706/1106 [==================>...........] - ETA: 7:52 - loss: nan - location_loss: nan - confidence_loss: nan - class_loss: nan Traceback (most recent call last): File "train.py", line 252, in <module>

For the record, I work in Ubuntu 18.04, with tf2.1 and I pulled the lastest commits from your repo.

So I switched to ‘use_diou_loss=True’ and so far all is fine, with a much better convergence than previously. This looks to be a very helpful addition !

Gilles

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:21 (6 by maintainers)

Top GitHub Comments

1reaction

gillmac13commented, May 22, 2020

Hi @farhodbekshamsiyev

It has been a while, and I haven’t tried the newest yolo v4 version, but the model type which worked best for me (1 class underwater object recognition) was clearly yolo3_spp. And since I needed the speed and compactness, I found mobilenetV2_lite very effective. Out of 7 different combinations of backbones and yolo versions, this choice is the clear winner (for my application). Since yolo v4 also uses the spp feature, I suppose it must be good or better. I intend to train on the combo yolo4+mobilenetv3 very soon. Anyway this is what I did (please note that the batch size of 16 is required as my GPU has a storage capacity of 8 GB): In what follows, my_yolo_class.txt has only 1 class, and my images are all 416x416x3

Precautions: before launching a training command line, and to avoid crashes… 1- I switched to diou_loss (and nms_diou) by setting to true “use_diou_loss” in /yolo3 /loss.py at line 230 (?) 2- In /common/utils.py, line 20(?), I changed the memory_limit = 7000 3- I disabled the Mixed Precision Training at the beginning of train.py

Command line: $ python3 train.py --model_type yolo3_mobilenetv2_lite_spp --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --batch_size 16 --transfer_epoch 4 --freeze_level 1 --total_epoch 40 --optimizer rmsprop --decay_type cosine

Also note that I switched from adam to rmsprop. I am not sure which of the changes from standard training really helped, but since it worked for me, I am happy! With 27000 images in my dataset, I found that after 40 epochs the total loss did not change anymore, therefore I stopped training at that point. Using the model on my test set was pretty good, so I suppose I did it right.

I hope it helps…

0reactions

yakhyocommented, Jun 15, 2021

I am having the same error after almost one and half year @david8862. Is there any exact reason for this kind of Gradient Exploding? I think this nan comes from gradient exploding. Doesn’t It? Would be great if you share your experience on it. Accually I am using YOLOv1