Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NaN losses during training!

See original GitHub issue

I’m following the exact same instructions for training, but during training with the command ./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16

+ set -e
+ export PYTHONUNBUFFERED=True
+ PYTHONUNBUFFERED=True
+ GPU_ID=0
+ DATASET=pascal_voc
+ NET=vgg16
+ array=($@)
+ len=3
+ EXTRA_ARGS=
+ EXTRA_ARGS_SLUG=
+ case ${DATASET} in
+ TRAIN_IMDB=voc_2007_trainval
+ TEST_IMDB=voc_2007_test
+ STEPSIZE=50000
+ ITERS=70000
+ ANCHORS='[8,16,32]'
+ RATIOS='[0.5,1,2]'
++ date +%Y-%m-%d_%H-%M-%S
+ LOG=experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ exec
++ tee -a experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ echo Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ set +x
+ '[' '!' -f output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt.index ']'
+ [[ ! -z '' ]]
+ CUDA_VISIBLE_DEVICES=0
+ time python ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.ckpt --imdb voc_2007_trainval --imdbval voc_2007_test --iters 70000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE 50000
Called with args:
Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=70000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '50000'], tag=None, weight='data/imagenet_weights/vgg16.ckpt')
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [8, 16, 32],
 'DATA_DIR': '/home/amirhf/Projects/tf-faster-rcnn/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'vgg16',
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'crop',
 'POOLING_SIZE': 7,
 'RESNET': {'BN_TRAIN': False, 'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/amirhf/Projects/tf-faster-rcnn',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 256,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': True,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
           'STEPSIZE': 50000,
           'SUMMARY_INTERVAL': 180,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0005},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/amirhf/Projects/tf-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
10022 roidb entries
Output will be saved to `/home/amirhf/Projects/tf-faster-rcnn/output/vgg16/voc_2007_trainval/default`
TensorFlow summaries will be saved to `/home/amirhf/Projects/tf-faster-rcnn/tensorboard/vgg16/voc_2007_trainval/default`
Loaded dataset `voc_2007_test` for training
Set proposal method: gt
Preparing training data...
wrote gt roidb to /home/amirhf/Projects/tf-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl
done
4952 validation roidb entries
Filtered 0 roidb entries: 10022 -> 10022
Filtered 0 roidb entries: 4952 -> 4952
2017-05-11 18:12:37.107319: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107338: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107344: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107350: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.404484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.291
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.27GiB
2017-05-11 18:12:37.404517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-05-11 18:12:37.404523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-05-11 18:12:37.404537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Solving...
/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Loading initial model weights from data/imagenet_weights/vgg16.ckpt
Varibles restored: vgg_16/conv1/conv1_1/biases:0
Varibles restored: vgg_16/conv1/conv1_2/weights:0
Varibles restored: vgg_16/conv1/conv1_2/biases:0
Varibles restored: vgg_16/conv2/conv2_1/weights:0
Varibles restored: vgg_16/conv2/conv2_1/biases:0
Varibles restored: vgg_16/conv2/conv2_2/weights:0
Varibles restored: vgg_16/conv2/conv2_2/biases:0
Varibles restored: vgg_16/conv3/conv3_1/weights:0
Varibles restored: vgg_16/conv3/conv3_1/biases:0
Varibles restored: vgg_16/conv3/conv3_2/weights:0
Varibles restored: vgg_16/conv3/conv3_2/biases:0
Varibles restored: vgg_16/conv3/conv3_3/weights:0
Varibles restored: vgg_16/conv3/conv3_3/biases:0
Varibles restored: vgg_16/conv4/conv4_1/weights:0
Varibles restored: vgg_16/conv4/conv4_1/biases:0
Varibles restored: vgg_16/conv4/conv4_2/weights:0
Varibles restored: vgg_16/conv4/conv4_2/biases:0
Varibles restored: vgg_16/conv4/conv4_3/weights:0
Varibles restored: vgg_16/conv4/conv4_3/biases:0
Varibles restored: vgg_16/conv5/conv5_1/weights:0
Varibles restored: vgg_16/conv5/conv5_1/biases:0
Varibles restored: vgg_16/conv5/conv5_2/weights:0
Varibles restored: vgg_16/conv5/conv5_2/biases:0
Varibles restored: vgg_16/conv5/conv5_3/weights:0
Varibles restored: vgg_16/conv5/conv5_3/biases:0
Varibles restored: vgg_16/fc6/biases:0
Varibles restored: vgg_16/fc7/biases:0
Loaded.
Fix VGG16 layers..
iter: 20 / 70000, total loss: 1.780578
 >>> rpn_loss_cls: 0.331266
 >>> rpn_loss_box: 0.058807
 >>> loss_cls: 0.851354
 >>> loss_box: 0.539151
 >>> lr: 0.001000
speed: 0.908s / iter
iter: 40 / 70000, total loss: 0.701749
 >>> rpn_loss_cls: 0.551406
 >>> rpn_loss_box: 0.128653
 >>> loss_cls: 0.021690
 >>> loss_box: 0.000000
 >>> lr: 0.001000
.
.  [REMOVED LINES TO MAKE THE POST SHORTER]
.
.
iter: 3380 / 70000, total loss: 0.616202
 >>> rpn_loss_cls: 0.100265
 >>> rpn_loss_box: 0.145635
 >>> loss_cls: 0.185931
 >>> loss_box: 0.184371
 >>> lr: 0.001000
speed: 0.433s / iter
iter: 3400 / 70000, total loss: 1.312786
 >>> rpn_loss_cls: 0.295694
 >>> rpn_loss_box: 0.017820
 >>> loss_cls: 0.452280
 >>> loss_box: 0.546992
 >>> lr: 0.001000
speed: 0.432s / iter
iter: 3420 / 70000, total loss: 0.642559
 >>> rpn_loss_cls: 0.132440
 >>> rpn_loss_box: 0.039820
 >>> loss_cls: 0.293447
 >>> loss_box: 0.176852
 >>> lr: 0.001000
speed: 0.431s / iter
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:56: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:58: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:60: RuntimeWarning: invalid value encountered in add
  pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:62: RuntimeWarning: invalid value encountered in add
  pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h
iter: 3440 / 70000, total loss: nan
 >>> rpn_loss_cls: nan
 >>> rpn_loss_box: nan
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000

There are those

RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w

errors and from there, losses become nan! I have changed nothing in the files!

Issue Analytics

State:
Created 6 years ago
Comments:26 (10 by maintainers)

Top GitHub Comments

4reactions

guojiapeng00commented, Apr 30, 2018

I had this error ,too, today, I make it!!! I find there are lots of boxes outside of my pics. for example, my pics are 600*600,but there is a box (550,550,650,650) when i delete these pics in trainval.txt, it works!!!

0reactions

nassarofficialcommented, Jul 8, 2019

hello, did u fix it？ I meet the same error and i tries it all day with no help. If you know why it happens, please tell me, i will be very appretriate! I had this error and the only fix was that I had problems in my xml annotation files, some were empty, and some bboxes had negative values. After eliminating them the error disappeared.