question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

'loss: nan' error while training with standard yolo_loss

See original GitHub issue

Hi David,

I just want to report a glitch in my experiments… I am training models (my own dataset = 27,000 annotations, 1 class) with the following cmd line:

python3 train.py --model_type yolo3_mobilenetv2_lite --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --save_eval_checkpoint --batch_size 16 --eval_online --eval_epoch_interval 3 --transfer_epoch 2 --freeze_level 1 --total_epoch 20

This is just an example, I tried a half dozen combinations of backbones and heads… Out of 10 trials, I only managed to reach epoch=20 twice. In the other cases, at some point (usually around epoch 4 to 9) I get a crash with this typical message:

705/1106 [==================>...........] - ETA: 7:54 - loss: 9.8939 - location_loss: 3.5176 - confidence_loss: 4.8495 - class_loss: 0.0014Batch 705: Invalid loss, terminating training

706/1106 [==================>...........] - ETA: 7:52 - loss: nan - location_loss: nan - confidence_loss: nan - class_loss: nan Traceback (most recent call last): File "train.py", line 252, in <module>

For the record, I work in Ubuntu 18.04, with tf2.1 and I pulled the lastest commits from your repo.

So I switched to ‘use_diou_loss=True’ and so far all is fine, with a much better convergence than previously. This looks to be a very helpful addition !

Gilles

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:21 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
gillmac13commented, May 22, 2020

Hi @farhodbekshamsiyev

It has been a while, and I haven’t tried the newest yolo v4 version, but the model type which worked best for me (1 class underwater object recognition) was clearly yolo3_spp. And since I needed the speed and compactness, I found mobilenetV2_lite very effective. Out of 7 different combinations of backbones and yolo versions, this choice is the clear winner (for my application). Since yolo v4 also uses the spp feature, I suppose it must be good or better. I intend to train on the combo yolo4+mobilenetv3 very soon. Anyway this is what I did (please note that the batch size of 16 is required as my GPU has a storage capacity of 8 GB): In what follows, my_yolo_class.txt has only 1 class, and my images are all 416x416x3

Precautions: before launching a training command line, and to avoid crashes… 1- I switched to diou_loss (and nms_diou) by setting to true “use_diou_loss” in /yolo3 /loss.py at line 230 (?) 2- In /common/utils.py, line 20(?), I changed the memory_limit = 7000 3- I disabled the Mixed Precision Training at the beginning of train.py

Command line: $ python3 train.py --model_type yolo3_mobilenetv2_lite_spp --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --batch_size 16 --transfer_epoch 4 --freeze_level 1 --total_epoch 40 --optimizer rmsprop --decay_type cosine

Also note that I switched from adam to rmsprop. I am not sure which of the changes from standard training really helped, but since it worked for me, I am happy! With 27000 images in my dataset, I found that after 40 epochs the total loss did not change anymore, therefore I stopped training at that point. Using the model on my test set was pretty good, so I suppose I did it right.

I hope it helps…

0reactions
yakhyocommented, Jun 15, 2021

I am having the same error after almost one and half year @david8862. Is there any exact reason for this kind of Gradient Exploding? I think this nan comes from gradient exploding. Doesn’t It? Would be great if you share your experience on it. Accually I am using YOLOv1

Read more comments on GitHub >

github_iconTop Results From Across the Web

Loss equals nan in training neural network with yolo custom ...
I'm using transfer learning using the MobileNetV2 architecture. P.S. - Loss goes to NAN when training the custom YOLO model As in this,...
Read more >
Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >
TensorFlow 2 YOLOv3 Mnist detection training tutorial
TensorFlow 2 YOLOv3 Mnist detection training tutorial. In this tutorial, I'll cover the Yolo v3 loss function and model training.
Read more >
Custom Loss Function Gradient Vector nan - PyTorch Forums
And the simplified yolo loss as the following: ... During training, the first loss is correctly calculated, but after the first gradient ...
Read more >
I don't understand why I am getting NaN loss scores. Can ...
22K subscribers in the neuralnetworks community. Subreddit about Artificial Neural Networks, Deep Learning and Machine Learning.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found