Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

nms_gpu throws runtime error: illegal memory access

See original GitHub issue

Trying to train this model on my own dataset.
I converted it to pascal voc format, assured max resolution is of 1000 (most images are 600x900), adjusted some fine details, but I get the following error while training:

Called with args:
Namespace(batch_size=8, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='my_custom_ds', disp_interval=100, large_scale=False, lr=0.004, lr_decay_gamma=0.1, lr_decay_step=8, mGPUs=True, max_epochs=2, net='res101', num_workers=2, optimizer='sgd', resume=False, save_dir='saved_models', session=1, start_epoch=1, use_tfboard=False)
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [8, 16, 32],
 'CROP_RESIZE_WITH_MAX_POOL': False,
 'CUDA': False,
 'DATA_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'res101',
 'FEAT_STRIDE': [16],
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'MAX_NUM_GT_BOXES': 93,
 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
               'FIXED_LAYERS': 5,
               'REGU_DEPTH': False,
               'WEIGHT_DECAY': 4e-05},
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'align',
 'POOLING_SIZE': 7,
 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_MIN_SIZE': 16,
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 128,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'BN_TRAIN': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': False,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_MIN_SIZE': 8,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
           'STEPSIZE': [30000],
           'SUMMARY_INTERVAL': 180,
           'TRIM_HEIGHT': 600,
           'TRIM_WIDTH': 600,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0001},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/cyb/user/pycharm/src/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 4224 images...
after filtering, there are 4224 images...
4224 roidb entries
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py:68: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py:98: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  cls_prob = F.softmax(cls_score)
[session 1][epoch  1][iter    0] loss: 233749.3594, lr: 4.00e-03
			fg/bg=(24/1000), time cost: 6.419112
			rpn_cls: 179158.4219, rpn_box: 41295.5859, rcnn_cls: 9535.8477, rcnn_box 3759.5171
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCTensorMath.cu line=267 error=77 : an illegal memory access was encountered
an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered, at line 147
CUDA Error: an illegal memory access was encountered, at line 154
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/trainval_net.py", line 326, in <module>
    rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
    raise output
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
    output = module(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py", line 50, in forward
    rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py", line 78, in forward
    im_info, cfg_key))
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_layer.py", line 148, in forward
    keep_idx_i = nms(torch.cat((proposals_single, scores_single), 1), nms_thresh)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_wrapper.py", line 18, in nms
    return nms_gpu(dets, thresh)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_gpu.py", line 11, in nms_gpu
    keep = keep[:num_out[0]]
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCStorage.c:36

Process finished with exit code 1

I have two Titan K40 cards, however it’s an illegal access and not out of memory error, so I wonder where does it come from.

Issue Analytics

State:
Created 6 years ago
Comments:19 (5 by maintainers)

Top GitHub Comments

2reactions

CodeJjangcommented, Mar 24, 2018

The NaN was fixed when I set the MAX_NUM_GT_BOXES to the correct value.

1reaction

CodeJjangcommented, Mar 23, 2018

@jwyang

[session 1][epoch 14][iter  500] loss: 0.0917, lr: 4.00e-04
			fg/bg=(96/416), time cost: 168.633036
			rpn_cls: 0.0002, rpn_box: 0.0006, rcnn_cls: 0.0357, rcnn_box 0.0256
[session 1][epoch 14][iter  600] loss: 0.0865, lr: 4.00e-04
			fg/bg=(90/422), time cost: 168.349854
			rpn_cls: 0.0015, rpn_box: 0.0009, rcnn_cls: 0.0078, rcnn_box 0.0114
[session 1][epoch 14][iter  700] loss: 0.0794, lr: 4.00e-04
			fg/bg=(118/394), time cost: 168.221013
			rpn_cls: 0.0028, rpn_box: 0.0021, rcnn_cls: 0.0474, rcnn_box 0.0575

However, the network seems to learn small vehicle better than large vehicle, and also, it cannot learn the solar panel class (which is quite small) for some reason:

AP for large vehicle = 0.7531
AP for small vehicle = 0.8962
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:204: RuntimeWarning: invalid value encountered in true_divide
  rec = tp / float(npos)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:45: RuntimeWarning: invalid value encountered in greater_equal
  if np.sum(rec >= t) == 0:
AP for solar panel = 0.0000
Mean AP = 0.5498
~~~~~~~~
Results:
0.753
0.896
0.000
0.550
~~~~~~~~

Any idea how to improve? Or why it fails on the solar panel? It throws some errors in the AP calculation though.

Edit: Just found out why it learns small vehicles better, that’s because they appear way more than large vehicles, and apparently I mistakenly filtered solar panels out of my train set, thats why its 0.