Nan losses during training on new dataset
See original GitHub issue> Loaded.
Fix VGG16 layers..
Fixed.
iter: 20 / 70000, total loss: 2.107100
>>> rpn_loss_cls: 0.689572
>>> rpn_loss_box: 0.538556
>>> loss_cls: 0.878972
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.860s / iter
iter: 40 / 70000, total loss: 1.400519
>>> rpn_loss_cls: 0.672595
>>> rpn_loss_box: 0.627981
>>> loss_cls: 0.099943
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.791s / iter
iter: 60 / 70000, total loss: 1.890739
>>> rpn_loss_cls: 0.664076
>>> rpn_loss_box: 1.136271
>>> loss_cls: 0.090392
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.707s / iter
iter: 80 / 70000, total loss: 0.884432
>>> rpn_loss_cls: 0.619964
>>> rpn_loss_box: 0.239456
>>> loss_cls: 0.025013
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.669s / iter
iter: 100 / 70000, total loss: 1.159740
>>> rpn_loss_cls: 0.625335
>>> rpn_loss_box: 0.505168
>>> loss_cls: 0.029237
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.658s / iter
/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:28: RuntimeWarning: invalid value encountered in log
targets_dw = np.log(gt_widths / ex_widths)
iter: 120 / 70000, total loss: nan
>>> rpn_loss_cls: 0.692661
>>> rpn_loss_box: nan
>>> loss_cls: 2.072727
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.686s / iter
iter: 140 / 70000, total loss: nan
>>> rpn_loss_cls: 0.692553
>>> rpn_loss_box: nan
>>> loss_cls: 2.069435
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.700s / iter
iter: 160 / 70000, total loss: nan
>>> rpn_loss_cls: 0.692714
>>> rpn_loss_box: nan
>>> loss_cls: 2.065971
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.717s / iter
iter: 180 / 70000, total loss: nan
>>> rpn_loss_cls: 0.692636
>>> rpn_loss_box: nan
>>> loss_cls: 2.062488
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.728s / iter
iter: 200 / 70000, total loss: nan
>>> rpn_loss_cls: 0.692119
>>> rpn_loss_box: nan
>>> loss_cls: 2.059007
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.737s / iter
iter: 220 / 70000, total loss: nan
>>> rpn_loss_cls: 0.691994
>>> rpn_loss_box: nan
>>> loss_cls: 2.055528
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.748s / iter
iter: 240 / 70000, total loss: nan
>>> rpn_loss_cls: 0.692261
>>> rpn_loss_box: nan
>>> loss_cls: 2.052053
>>> loss_box: 0.000000
>>> lr: 0.000010
speed: 0.752s / iter
2017-06-15 12:11:33.044274: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.045363: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.045636: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.045901: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.047963: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.049066: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.049813: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.049915: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.049894: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050219: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050255: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050281: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050306: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050333: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050356: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050380: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050405: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050431: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050456: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.050476: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
2017-06-15 12:11:33.103636: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
Traceback (most recent call last):
File "./tools/trainval_net.py", line 136, in <module>
max_iters=args.max_iters)
File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/train_val.py", line 336, in train_net
sw.train_model(sess, max_iters)
File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/train_val.py", line 225, in train_model
self.net.train_step_with_summary(sess, blobs, train_op)
File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/nets/network.py", line 395, in train_step_with_summary
feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
[[Node: vgg_16/anchor/PyFunc/_319 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1164_vgg_16/anchor/PyFunc", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'TRAIN/vgg_16/conv3/conv3_1/weights', defined at:
File "./tools/trainval_net.py", line 136, in <module>
max_iters=args.max_iters)
File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/train_val.py", line 336, in train_net
sw.train_model(sess, max_iters)
File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/model/train_val.py", line 105, in train_model
anchor_ratios=cfg.ANCHOR_RATIOS)
File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/nets/network.py", line 333, in create_architecture
self._add_train_summary(var)
File "/home/gabbar/ML/tf-faster-rcnn/tools/../lib/nets/network.py", line 72, in _add_train_summary
tf.summary.histogram('TRAIN/' + var.op.name, var)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 209, in histogram
tag=scope.rstrip('/'), values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 139, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Nan in summary histogram for: TRAIN/vgg_16/conv3/conv3_1/weights
[[Node: TRAIN/vgg_16/conv3/conv3_1/weights = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv3/conv3_1/weights/tag, vgg_16/conv3/conv3_1/weights/read/_287)]]
[[Node: vgg_16/anchor/PyFunc/_319 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1164_vgg_16/anchor/PyFunc", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Command exited with non-zero status 1
275.06user 24.81system 5:00.02elapsed 99%CPU (0avgtext+0avgdata 3047940maxresident)k
216inputs+390712outputs (1major+680645minor)pagefaults 0swaps
Issue Analytics
- State:
- Created 6 years ago
- Comments:24 (3 by maintainers)
Top Results From Across the Web
Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >Deep-Learning Nan loss reasons - python - Stack Overflow
It's a natural property of stochastic gradient descent, if the learning rate is too large, SGD can diverge into infinity · @YaroslavBulatov I've...
Read more >'Training loss' gets 'nan' when training deeplearning model
hi all, I am trying to extract greenspace from the drone. but when I training the deep learning model, the 'Training loss' and...
Read more >Getting Nan Loss when training Deep neural Recommender ...
During training I am getting all loss as Nan. I have tried to debug the same using Debugger V2 and I could see...
Read more >Debugging a Machine Learning model written in TensorFlow ...
The assertions don't trigger, but the program still ends with a NaN loss. At this point, it seems clear the problem does not...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@atorefrank , @R3v0lut10nist , @endernewton @zdm123 , I get the same problem with train my data , the rpn_box_loss is nan, after some research, it’s because in the file ‘pascal_voc.py’, the function ‘_load_pascal_annotation’ has an operation of make pixel indexes 0-based,the code is : x1 = float(bbox.find(‘xmin’).text) - 1 y1 = float(bbox.find(‘ymin’).text) - 1 x2 = float(bbox.find(‘xmax’).text) - 1 y2 = float(bbox.find(‘ymax’).text) - 1 but if your data is not based 1, such as my data is based 0, then it will get -1 in the data, may be you can try to delete the -1 operation,hope helpful!
oh i should mention that with a new dataset you may need to tune the lr a bit.