question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting Nan loss while training

See original GitHub issue

I have a dataset containing 846 images but when start training I am getting there are 1692 images. I have the dataset in PASCAL_VOC format. The JPEGImages folder contains 846 images. On training, I am getting loss:nan. Can you please let me know the reason for the same? Preparing training data... done before filtering, there are 1692 images... after filtering, there are 1692 images... 1692 roidb entries Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth [session 1][epoch 1][iter 0] loss: 6.7142, lr: 1.00e-03 fg/bg=(2/126), time cost: 238.602555 rpn_cls: 0.7190, rpn_box: 1.7119, rcnn_cls: 4.2830, rcnn_box 0.0003 [session 1][epoch 1][iter 100] loss: nan, lr: 1.00e-03 fg/bg=(13/115), time cost: 40.301977 rpn_cls: 0.5280, rpn_box: nan, rcnn_cls: 0.7082, rcnn_box 0.0000 [session 1][epoch 1][iter 200] loss: nan, lr: 1.00e-03 fg/bg=(32/96), time cost: 40.584164 rpn_cls: 0.3966, rpn_box: nan, rcnn_cls: 1.0526, rcnn_box 0.0000 [session 1][epoch 1][iter 300] loss: nan, lr: 1.00e-03 fg/bg=(8/120), time cost: 41.294393 rpn_cls: 0.4398, rpn_box: nan, rcnn_cls: 0.6331, rcnn_box 0.0000 [session 1][epoch 1][iter 400] loss: nan, lr: 1.00e-03 fg/bg=(32/96), time cost: 42.057193 rpn_cls: 0.2161, rpn_box: nan, rcnn_cls: 0.9535, rcnn_box 0.0000 [session 1][epoch 1][iter 500] loss: nan, lr: 1.00e-03 fg/bg=(32/96), time cost: 41.014715 rpn_cls: 0.1673, rpn_box: nan, rcnn_cls: 0.9406, rcnn_box 0.0000 [session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03 fg/bg=(32/96), time cost: 42.453671 rpn_cls: 0.1687, rpn_box: nan, rcnn_cls: 0.9308, rcnn_box 0.0000

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:10

github_iconTop GitHub Comments

25reactions
ashutoshIITKcommented, May 21, 2018

@super-wcg Yes, I solved the problem of getting NaN Loss. It was due to the error in the coordinates. The following things were giving NaN loss 1.Coordinates out of the image resolution------------> NaN Loss 2. xmin=xmax-----------> Results in NaN Loss 3. ymin==ymax-----------> Results in Nan Loss 4. The size of bounding box was very small-----------> Results in NaN Loss

For the 4th case, we put a condition that the difference of |xmax -xmin| >= 20 and similarly |ymax- ymin| >=20

I trained the model (For 20 epochs) after fixing all this and didn’t get NaN Loss error.

Thank you.

11reactions
swchuicommented, May 5, 2018

There are somthing wrong about your dataset. 1.In the “\lib\dataset\pascal_voc.py”, change the" x1 = float(bbox.find(‘xmin’).text) - 1 y1 = float(bbox.find(‘ymin’).text) - 1" to x1 = float(bbox.find(‘xmin’).text) y1 = float(bbox.find(‘ymin’).text) " delete the “-1”. 2. then “rm -rf $your data cache$” Maybe the log(-1) lead to this error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Deep-Learning Nan loss reasons - python - Stack Overflow
You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are...
Read more >
Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >
Getting NaN for loss - General Discussion - TensorFlow Forum
You transform X_train but pass X_train_A and X_train_B into the model, which were never transformed by the scaler and contain negative values.
Read more >
Debugging a Machine Learning model written in TensorFlow ...
I wrote up a convnet model borrowing liberally from the training loop of the ... NaN loss. Now, when I ran it though,...
Read more >
Keras Sequential model returns loss 'nan'
@lcrmorin I'm pretty sure that my dataset doesn't contain nan elements. However, I notice that the loss turn to nan when I changed...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found