question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nan losses during training on new dataset

See original GitHub issue
> Loaded.
Fix VGG16 layers..
Fixed.
iter: 20 / 70000, total loss: 2.107100
 >>> rpn_loss_cls: 0.689572
 >>> rpn_loss_box: 0.538556
 >>> loss_cls: 0.878972
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.860s / iter
iter: 40 / 70000, total loss: 1.400519
 >>> rpn_loss_cls: 0.672595
 >>> rpn_loss_box: 0.627981
 >>> loss_cls: 0.099943
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.791s / iter
iter: 60 / 70000, total loss: 1.890739
 >>> rpn_loss_cls: 0.664076
 >>> rpn_loss_box: 1.136271
 >>> loss_cls: 0.090392
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.707s / iter
iter: 80 / 70000, total loss: 0.884432
 >>> rpn_loss_cls: 0.619964
 >>> rpn_loss_box: 0.239456
 >>> loss_cls: 0.025013
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.669s / iter
iter: 100 / 70000, total loss: 1.159740
 >>> rpn_loss_cls: 0.625335
 >>> rpn_loss_box: 0.505168
 >>> loss_cls: 0.029237
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.658s / iter
/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:28: RuntimeWarning: invalid value encountered in log
  targets_dw = np.log(gt_widths / ex_widths)
iter: 120 / 70000, total loss: nan
 >>> rpn_loss_cls: 0.692661
 >>> rpn_loss_box: nan
 >>> loss_cls: 2.072727
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.686s / iter
iter: 140 / 70000, total loss: nan
 >>> rpn_loss_cls: 0.692553
 >>> rpn_loss_box: nan
 >>> loss_cls: 2.069435
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.700s / iter
iter: 160 / 70000, total loss: nan
 >>> rpn_loss_cls: 0.692714
 >>> rpn_loss_box: nan
 >>> loss_cls: 2.065971
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.717s / iter
iter: 180 / 70000, total loss: nan
 >>> rpn_loss_cls: 0.692636
 >>> rpn_loss_box: nan
 >>> loss_cls: 2.062488
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.728s / iter
iter: 200 / 70000, total loss: nan
 >>> rpn_loss_cls: 0.692119
 >>> rpn_loss_box: nan
 >>> loss_cls: 2.059007
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.737s / iter
iter: 220 / 70000, total loss: nan
 >>> rpn_loss_cls: 0.691994
 >>> rpn_loss_box: nan
 >>> loss_cls: 2.055528
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.748s / iter
iter: 240 / 70000, total loss: nan
 >>> rpn_loss_cls: 0.692261
 >>> rpn_loss_box: nan
 >>> loss_cls: 2.052053
 >>> loss_box: 0.000000
 >>> lr: 0.000010
speed: 0.752s / iter
2017-06-15 12:11:33.044274: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.045363: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.045636: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.045901: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.047963: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.049066: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.049813: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.049915: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.049894: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050219: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050255: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050281: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050306: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050333: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050356: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050380: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050405: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050431: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050456: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050476: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.103636: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
Traceback (most recent call last):
  File "./tools/trainval_net.py", line 136, in <module>
    max_iters=args.max_iters)
  File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/train_val.py", line 336, in train_net
    sw.train_model(sess, max_iters)
  File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/train_val.py", line 225, in train_model
    self.net.train_step_with_summary(sess, blobs, train_op)
  File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/nets/network.py", line 395, in train_step_with_summary
    feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
	 [[Node: vgg_16/anchor/PyFunc/_319 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1164_vgg_16/anchor/PyFunc", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'TRAIN/vgg_16/conv3/conv3_1/weights', defined at:
  File "./tools/trainval_net.py", line 136, in <module>
    max_iters=args.max_iters)
  File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/train_val.py", line 336, in train_net
    sw.train_model(sess, max_iters)
  File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/train_val.py", line 105, in train_model
    anchor_ratios=cfg.ANCHOR_RATIOS)
  File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/nets/network.py", line 333, in create_architecture
    self._add_train_summary(var)
  File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/nets/network.py", line 72, in _add_train_summary
    tf.summary.histogram('TRAIN/' + var.op.name, var)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 209, in histogram
    tag=scope.rstrip('/'), values=values, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 139, in _histogram_summary
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
	 [[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
	 [[Node: vgg_16/anchor/PyFunc/_319 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1164_vgg_16/anchor/PyFunc", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Command exited with non-zero status 1
275.06user 24.81system 5:00.02elapsed 99%CPU (0avgtext+0avgdata 3047940maxresident)k
216inputs+390712outputs (1major+680645minor)pagefaults 0swaps

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:24 (3 by maintainers)

github_iconTop GitHub Comments

9reactions
lonlonagocommented, Oct 23, 2017

@atorefrank , @R3v0lut10nist , @endernewton @zdm123 , I get the same problem with train my data , the rpn_box_loss is nan, after some research, it’s because in the file ‘pascal_voc.py’, the function ‘_load_pascal_annotation’ has an operation of make pixel indexes 0-based,the code is : x1 = float(bbox.find(‘xmin’).text) - 1 y1 = float(bbox.find(‘ymin’).text) - 1 x2 = float(bbox.find(‘xmax’).text) - 1 y2 = float(bbox.find(‘ymax’).text) - 1 but if your data is not based 1, such as my data is based 0, then it will get -1 in the data, may be you can try to delete the -1 operation,hope helpful!

1reaction
endernewtoncommented, Jul 21, 2017

oh i should mention that with a new dataset you may need to tune the lr a bit.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >
Deep-Learning Nan loss reasons - python - Stack Overflow
It's a natural property of stochastic gradient descent, if the learning rate is too large, SGD can diverge into infinity · @YaroslavBulatov I've...
Read more >
'Training loss' gets 'nan' when training deeplearning model
hi all, I am trying to extract greenspace from the drone. but when I training the deep learning model, the 'Training loss' and...
Read more >
Getting Nan Loss when training Deep neural Recommender ...
During training I am getting all loss as Nan. I have tried to debug the same using Debugger V2 and I could see...
Read more >
Debugging a Machine Learning model written in TensorFlow ...
The assertions don't trigger, but the program still ends with a NaN loss. At this point, it seems clear the problem does not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found