question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA error: an illegal memory access was encountered

See original GitHub issue

@CoinCheung ,hello! I want to train my own dataset with two classes : 0 and 1! I changed codes in bisenetv2.py,cityscapes_cv2.py, train.py,base_dataset.py as follow! But when i run with CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 tools/train.py --model bisenetv2 ,I get the error, can you give me some advice?Thank you! the error:

Traceback (most recent call last):
  File "tools/train.py", line 232, in <module>
    main()
  File "tools/train.py", line 226, in main
    train()
  File "tools/train.py", line 175, in train
    loss_pre = criteria_pre(logits, lb)
  File "/home/liyanzhou/anaconda3/envs/py36_torch16/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./lib/ohem_ce_loss.py", line 43, in forward
    loss_hard = loss[loss > self.thresh]
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

i changed codes: bisenetv2.py:

cfg = dict(
    model_type='bisenetv2',
    num_aux_heads=4,
    lr_start = 5e-2,
    weight_decay=5e-4,
    warmup_iters = 1000,
    max_iter = 150000,                              # iter  150000
    im_root='./datasets/dianchipian',                          # dianchipian     # cityscapes
    train_im_anns='./datasets/dianchipian/train.txt',
    val_im_anns='./datasets/dianchipian/val.txt',
    scales=[0.25, 2.],
    cropsize=[512, 1024],
    ims_per_gpu=2,             # bath_size 8
    use_fp16=True,
    use_sync_bn=False,
    respth='./res',
)

cityscapes_cv2.py: class CityScapes(BaseDataset): ‘’’ ‘’’ def init(self, dataroot, annpath, trans_func=None, mode=‘train’): super(CityScapes, self).init( dataroot, annpath, trans_func, mode) self.n_cats = 2 # 19 self.lb_ignore = 255 self.lb_map = np.arange(256).astype(np.uint8) #for el in labels_info: # self.lb_map[el['id']] = el['trainId']

    self.to_tensor = T.ToTensor(
        mean=(0.3257, 0.3690, 0.3223), # city, rgb
        std=(0.2112, 0.2148, 0.2115),
    )

base_dataset.py: #if not self.lb_map is None: # label = self.lb_map[label] train.py: net = model_factory[cfg.model_type](2)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
yanzhou-licommented, Aug 13, 2021

Have you solved the problem now ?

Yeah,I solved it.The result just like @ggohem said :The label pixel does not correspond, there are other pixel categories in the label that are not listed.We discussed it privately! Thank you !

Hi, @yanzhou-li , I give same error, can you public advice! Try checking your label ,Check if it has only two pixel values.Maybe after reading the label by cv2.imread(lbpth,0) in base_dataset.py,the image pixel changes.To prevent this kind of mistake, binary processing after read the label,as follow: _, label = cv2.threshold(label, 0, 255, cv2.THRESH_BINARY in in base_dataset.py

1reaction
yanzhou-licommented, May 24, 2021

Have you solved the problem now ?

Yeah,I solved it.The result just like @ggohem said :The label pixel does not correspond, there are other pixel categories in the label that are not listed.We discussed it privately! Thank you !

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: CUDA error: an illegal memory access was ...
Hi,everyone! I met a strange illegal memory access error. It happens randomly without any regular pattern. The code is really simple.
Read more >
PyTorch CUDA error: an illegal memory access was ...
RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API ...
Read more >
CUDA error: an illegal memory access was encountered with ...
Try to use the latest PyTorch (1.10). The error indicates an out of bound memory access similar to a segfault on the CPU,...
Read more >
PyTorch RuntimeError: CUDA error: an illegal memory access ...
I've designed a network, which gives a weird error. It occurs randomly and can throw an exception in different epochs.
Read more >
CUDA error: an illegal memory access was encountered - Part ...
When I am running following code on Gradient, it is working fine but it is throwing me error after running for few seconds...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found