question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

nms_gpu throws runtime error: illegal memory access

See original GitHub issue

Trying to train this model on my own dataset.
I converted it to pascal voc format, assured max resolution is of 1000 (most images are 600x900), adjusted some fine details, but I get the following error while training:

Called with args:
Namespace(batch_size=8, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='my_custom_ds', disp_interval=100, large_scale=False, lr=0.004, lr_decay_gamma=0.1, lr_decay_step=8, mGPUs=True, max_epochs=2, net='res101', num_workers=2, optimizer='sgd', resume=False, save_dir='saved_models', session=1, start_epoch=1, use_tfboard=False)
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [8, 16, 32],
 'CROP_RESIZE_WITH_MAX_POOL': False,
 'CUDA': False,
 'DATA_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'res101',
 'FEAT_STRIDE': [16],
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'MAX_NUM_GT_BOXES': 93,
 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
               'FIXED_LAYERS': 5,
               'REGU_DEPTH': False,
               'WEIGHT_DECAY': 4e-05},
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'align',
 'POOLING_SIZE': 7,
 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_MIN_SIZE': 16,
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 128,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'BN_TRAIN': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': False,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_MIN_SIZE': 8,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
           'STEPSIZE': [30000],
           'SUMMARY_INTERVAL': 180,
           'TRIM_HEIGHT': 600,
           'TRIM_WIDTH': 600,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0001},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/cyb/user/pycharm/src/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 4224 images...
after filtering, there are 4224 images...
4224 roidb entries
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py:68: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py:98: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  cls_prob = F.softmax(cls_score)
[session 1][epoch  1][iter    0] loss: 233749.3594, lr: 4.00e-03
			fg/bg=(24/1000), time cost: 6.419112
			rpn_cls: 179158.4219, rpn_box: 41295.5859, rcnn_cls: 9535.8477, rcnn_box 3759.5171
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCTensorMath.cu line=267 error=77 : an illegal memory access was encountered
an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered, at line 147
CUDA Error: an illegal memory access was encountered, at line 154
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/trainval_net.py", line 326, in <module>
    rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
    raise output
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
    output = module(*input, **kwargs)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py", line 50, in forward
    rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py", line 78, in forward
    im_info, cfg_key))
  File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_layer.py", line 148, in forward
    keep_idx_i = nms(torch.cat((proposals_single, scores_single), 1), nms_thresh)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_wrapper.py", line 18, in nms
    return nms_gpu(dets, thresh)
  File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_gpu.py", line 11, in nms_gpu
    keep = keep[:num_out[0]]
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCStorage.c:36

Process finished with exit code 1

I have two Titan K40 cards, however it’s an illegal access and not out of memory error, so I wonder where does it come from.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:19 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
CodeJjangcommented, Mar 24, 2018

The NaN was fixed when I set the MAX_NUM_GT_BOXES to the correct value.

1reaction
CodeJjangcommented, Mar 23, 2018

@jwyang

[session 1][epoch 14][iter  500] loss: 0.0917, lr: 4.00e-04
			fg/bg=(96/416), time cost: 168.633036
			rpn_cls: 0.0002, rpn_box: 0.0006, rcnn_cls: 0.0357, rcnn_box 0.0256
[session 1][epoch 14][iter  600] loss: 0.0865, lr: 4.00e-04
			fg/bg=(90/422), time cost: 168.349854
			rpn_cls: 0.0015, rpn_box: 0.0009, rcnn_cls: 0.0078, rcnn_box 0.0114
[session 1][epoch 14][iter  700] loss: 0.0794, lr: 4.00e-04
			fg/bg=(118/394), time cost: 168.221013
			rpn_cls: 0.0028, rpn_box: 0.0021, rcnn_cls: 0.0474, rcnn_box 0.0575

However, the network seems to learn small vehicle better than large vehicle, and also, it cannot learn the solar panel class (which is quite small) for some reason:

AP for large vehicle = 0.7531
AP for small vehicle = 0.8962
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:204: RuntimeWarning: invalid value encountered in true_divide
  rec = tp / float(npos)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/datasets/voc_eval.py:45: RuntimeWarning: invalid value encountered in greater_equal
  if np.sum(rec >= t) == 0:
AP for solar panel = 0.0000
Mean AP = 0.5498
~~~~~~~~
Results:
0.753
0.896
0.000
0.550
~~~~~~~~

Any idea how to improve? Or why it fails on the solar panel? It throws some errors in the AP calculation though.

Edit: Just found out why it learns small vehicles better, that’s because they appear way more than large vehicles, and apparently I mistakenly filtered solar panels out of my train set, thats why its 0.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: CUDA error: an illegal memory access was ...
I met a strange illegal memory access error. It happens randomly without any regular pattern. The code is really simple.
Read more >
PyTorch CUDA error: an illegal memory access was ...
It was partially said by the answer of the OP, but the problem under the hood with illegal memory access is that the...
Read more >
Weird CUDA illegal memory access error - PyTorch Forums
Hi all, I encountered a weird CUDA illegal memory access error. ... RuntimeError: cuda runtime error (77) : an illegal memory access was ......
Read more >
PyTorch RuntimeError: CUDA error: an illegal memory access ...
I've designed a network, which gives a weird error. It occurs randomly and can throw an exception in different epochs.
Read more >
Cuda Error in executeInternal: 700 (an illegal memory access ...
rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 700 (an illegal memory access was encountered) terminate called after throwing an instance ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found