nms_gpu throws runtime error: illegal memory access
See original GitHub issueTrying to train this model on my own dataset.
I converted it to pascal voc format, assured max resolution is of 1000 (most images are 600x900), adjusted some fine details, but I get the following error while training:
Called with args:
Namespace(batch_size=8, checkepoch=1, checkpoint=0, checkpoint_interval=10000, checksession=1, class_agnostic=False, cuda=True, dataset='my_custom_ds', disp_interval=100, large_scale=False, lr=0.004, lr_decay_gamma=0.1, lr_decay_step=8, mGPUs=True, max_epochs=2, net='res101', num_workers=2, optimizer='sgd', resume=False, save_dir='saved_models', session=1, start_epoch=1, use_tfboard=False)
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
'ANCHOR_SCALES': [8, 16, 32],
'CROP_RESIZE_WITH_MAX_POOL': False,
'CUDA': False,
'DATA_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'res101',
'FEAT_STRIDE': [16],
'GPU_ID': 0,
'MATLAB': 'matlab',
'MAX_NUM_GT_BOXES': 93,
'MOBILENET': {'DEPTH_MULTIPLIER': 1.0,
'FIXED_LAYERS': 5,
'REGU_DEPTH': False,
'WEIGHT_DECAY': 4e-05},
'PIXEL_MEANS': array([[[ 102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'align',
'POOLING_SIZE': 7,
'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False},
'RNG_SEED': 3,
'ROOT_DIR': '/home/cyb/user/pycharm/src/faster-rcnn.pytorch',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'gt',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 128,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'BN_TRAIN': False,
'DISPLAY': 20,
'DOUBLE_BIAS': False,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.001,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 8,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'res101_faster_rcnn',
'STEPSIZE': [30000],
'SUMMARY_INTERVAL': 180,
'TRIM_HEIGHT': 600,
'TRIM_WIDTH': 600,
'TRUNCATED': False,
'USE_ALL_GT': True,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0001},
'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/cyb/user/pycharm/src/faster-rcnn.pytorch/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
before filtering, there are 4224 images...
after filtering, there are 4224 images...
4224 roidb entries
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py:68: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
rpn_cls_prob_reshape = F.softmax(rpn_cls_score_reshape)
/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py:98: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
cls_prob = F.softmax(cls_score)
[session 1][epoch 1][iter 0] loss: 233749.3594, lr: 4.00e-03
fg/bg=(24/1000), time cost: 6.419112
rpn_cls: 179158.4219, rpn_box: 41295.5859, rcnn_cls: 9535.8477, rcnn_box 3759.5171
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCTensorMath.cu line=267 error=77 : an illegal memory access was encountered
an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered, at line 147
CUDA Error: an illegal memory access was encountered, at line 154
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/trainval_net.py", line 326, in <module>
rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
output = module(*input, **kwargs)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/faster_rcnn/faster_rcnn.py", line 50, in forward
rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/rpn.py", line 78, in forward
im_info, cfg_key))
File "/home/cyb/user/.conda/envs/my_proj/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/rpn/proposal_layer.py", line 148, in forward
keep_idx_i = nms(torch.cat((proposals_single, scores_single), 1), nms_thresh)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_wrapper.py", line 18, in nms
return nms_gpu(dets, thresh)
File "/home/cyb/user/pycharm/src/faster-rcnn.pytorch/lib/model/nms/nms_gpu.py", line 11, in nms_gpu
keep = keep[:num_out[0]]
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/generic/THCStorage.c:36
Process finished with exit code 1
I have two Titan K40 cards, however it’s an illegal access and not out of memory error, so I wonder where does it come from.
Issue Analytics
- State:
- Created 6 years ago
- Comments:19 (5 by maintainers)
Top Results From Across the Web
RuntimeError: CUDA error: an illegal memory access was ...
I met a strange illegal memory access error. It happens randomly without any regular pattern. The code is really simple.
Read more >PyTorch CUDA error: an illegal memory access was ...
It was partially said by the answer of the OP, but the problem under the hood with illegal memory access is that the...
Read more >Weird CUDA illegal memory access error - PyTorch Forums
Hi all, I encountered a weird CUDA illegal memory access error. ... RuntimeError: cuda runtime error (77) : an illegal memory access was ......
Read more >PyTorch RuntimeError: CUDA error: an illegal memory access ...
I've designed a network, which gives a weird error. It occurs randomly and can throw an exception in different epochs.
Read more >Cuda Error in executeInternal: 700 (an illegal memory access ...
rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 700 (an illegal memory access was encountered) terminate called after throwing an instance ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The NaN was fixed when I set the
MAX_NUM_GT_BOXES
to the correct value.@jwyang
However, the network seems to learn small vehicle better than large vehicle, and also, it cannot learn the solar panel class (which is quite small) for some reason:
Any idea how to improve? Or why it fails on the solar panel? It throws some errors in the AP calculation though.
Edit: Just found out why it learns small vehicles better, that’s because they appear way more than large vehicles, and apparently I mistakenly filtered solar panels out of my train set, thats why its 0.