RuntimeError: CUDA error: an illegal memory access was encountered
See original GitHub issue@CoinCheung ,hello! I want to train my own dataset with two classes : 0 and 1! I changed codes in bisenetv2.py
,cityscapes_cv2.py
, train.py
,base_dataset.py
as follow! But when i run with CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 tools/train.py --model bisenetv2
,I get the error, can you give me some advice?Thank you!
the error:
Traceback (most recent call last):
File "tools/train.py", line 232, in <module>
main()
File "tools/train.py", line 226, in main
train()
File "tools/train.py", line 175, in train
loss_pre = criteria_pre(logits, lb)
File "/home/liyanzhou/anaconda3/envs/py36_torch16/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "./lib/ohem_ce_loss.py", line 43, in forward
loss_hard = loss[loss > self.thresh]
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
i changed codes:
bisenetv2.py
:
cfg = dict(
model_type='bisenetv2',
num_aux_heads=4,
lr_start = 5e-2,
weight_decay=5e-4,
warmup_iters = 1000,
max_iter = 150000, # iter 150000
im_root='./datasets/dianchipian', # dianchipian # cityscapes
train_im_anns='./datasets/dianchipian/train.txt',
val_im_anns='./datasets/dianchipian/val.txt',
scales=[0.25, 2.],
cropsize=[512, 1024],
ims_per_gpu=2, # bath_size 8
use_fp16=True,
use_sync_bn=False,
respth='./res',
)
cityscapes_cv2.py
:
class CityScapes(BaseDataset):
‘’’
‘’’
def init(self, dataroot, annpath, trans_func=None, mode=‘train’):
super(CityScapes, self).init(
dataroot, annpath, trans_func, mode)
self.n_cats = 2
# 19
self.lb_ignore = 255
self.lb_map = np.arange(256).astype(np.uint8)
#for el in labels_info:
# self.lb_map[el['id']] = el['trainId']
self.to_tensor = T.ToTensor(
mean=(0.3257, 0.3690, 0.3223), # city, rgb
std=(0.2112, 0.2148, 0.2115),
)
base_dataset.py
:
#if not self.lb_map is None:
# label = self.lb_map[label]
train.py
:
net = model_factory[cfg.model_type](2)
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (2 by maintainers)
Top GitHub Comments
Yeah,I solved it.The result just like @ggohem said :
The label pixel does not correspond, there are other pixel categories in the label that are not listed.
We discussed it privately! Thank you !