RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
See original GitHub issuesorry to bother you , i got this problem when i run train.py
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [122,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [123,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [124,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [125,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [126,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [93,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [94,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [263,0,0], thread: [62,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [263,0,0], thread: [63,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
Traceback (most recent call last):
File “train.py”, line 133, in <module>
loss = model(imgs, gts)
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call
result = self.forward(*input, **kwargs)
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 83, in parallel_apply
raise output
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 59, in _worker
output = module(*input, **kwargs)
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call
result = self.forward(*input, **kwargs)
File “/home/heal/TorchSeg-master/model/bisenet/cityscapes.bisenet.R18/network.py”, line 105, in forward
aux_loss0 = self.ohem_criterion(self.heads0, label)
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call
result = self.forward(*input, **kwargs)
File “/home/heal/TorchSeg-master/furnace/seg_opr/loss_opr.py”, line 85, in forward
index = mask_prob.argsort()
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/tensor.py”, line 248, in argsort
return torch.argsort(self, dim, descending)
File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/functional.py”, line 648, in argsort
return torch.sort(input, -1, descending)[1]
RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File “train.py”, line 167, in <module> config.log_dir_link) File “/home/heal/TorchSeg-master/furnace/engine/engine.py”, line 154, in exit torch.cuda.empty_cache() File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py”, line 374, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered
and this is my dataset:class Camvid(BaseDataset): trans_labels = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27, 28,29,30,31,32]
@classmethod
def get_class_colors(*args):
return [[64,128,64],[192,0,128],[0,128,192],[0,128,64],[128,0,0],[64,0,128],[64,0,192],[192,128,64],[192,192,128],
[64,64,128],[128,0,192],[192,0,64],[128,128,64],[192,0,192],[128,64,64],[64,192,128],[64,64,0],[128,64,128],
[128,128,92],[0,0,192],[192,128,128],[128,128,128],[64,128,192],[0,0,64],[0,64,64],[192,64,128],[128,128,0]
[192,128,192][64,0,64][192,192,0][0,0,0],[64,192,0]]
@classmethod
def get_class_names(*args):
return ['Animal',
‘Archway’, ‘Bicyclist’, ‘Bridge’, ‘Building’, ‘Car’, ‘CartLuggagePram’, ‘Child’, ‘Column_Pole’, ‘Fence’, ‘LaneMkgsDriv’, ‘LaneMkgsNonDriv’, ‘Misc_Text’, ‘MotorcycleScooter’, ‘OtherMoving’, ‘ParkingBlock’, ‘Pedestrian’, ‘Road’, ‘RoadShoulder’, ‘Sidewalk’, ‘SignSymbol’, ‘Sky’, ‘SUVPickupTruck’, ‘TrafficCone’, ‘TrafficLight’, ‘Train’, ‘Tree’, ‘Truck_Bus’, ‘Tunnel’, ‘VegetationMisc’, ‘Void’, ‘Wall’, ]
this is my config:C = edict() config = C cfg = C
C.seed = 12345
“”“please config ROOT_dir and user when u first using”“” C.repo_name = ‘TorchSeg’ C.abs_dir = osp.realpath(“.”) C.this_dir = C.abs_dir.split(osp.sep)[-1] C.root_dir = C.abs_dir[:C.abs_dir.index(C.repo_name) + len(C.repo_name)] C.log_dir = osp.abspath(osp.join(C.root_dir, ‘log’, C.this_dir)) C.log_dir_link = osp.join(C.abs_dir, ‘log’) C.snapshot_dir = osp.abspath(osp.join(C.log_dir, “snapshot”))
exp_time = time.strftime(‘%Y_%m_%d_%H_%M_%S’, time.localtime()) C.log_file = C.log_dir + ‘/log_’ + exp_time + ‘.log’ C.link_log_file = C.log_file + ‘/log_last.log’ C.val_log_file = C.log_dir + ‘/val_’ + exp_time + ‘.log’ C.link_val_log_file = C.log_dir + ‘/val_last.log’
“”“Data Dir and Weight Dir”“” C.dataset_path = “/home/heal/TorchSeg-master/data/CamVid/” C.img_root_folder = C.dataset_path C.gt_root_folder = C.dataset_path C.train_source = osp.join(C.dataset_path, “train.txt”) C.eval_source = osp.join(C.dataset_path, “val.txt”) C.test_source = osp.join(C.dataset_path, “test.txt”) C.is_test = False
“”“Path Config”“”
def add_path(path): if path not in sys.path: sys.path.insert(0, path)
add_path(osp.join(C.root_dir, ‘furnace’))
=============================================================================
from torch.utils.pyt_utils import model_urls
=============================================================================
“”“Image Config”“” C.num_classes = 32 C.background = 0 C.image_mean = np.array([0.485, 0.456, 0.406]) # 0.485, 0.456, 0.406 C.image_std = np.array([0.229, 0.224, 0.225]) C.target_size = 512 C.image_height = 512 C.image_width = 512 C.num_train_imgs = 420 C.num_eval_imgs = 20
“”" Settings for network, this would be different for each kind of model"“” C.fix_bias = True C.fix_bn = False C.sync_bn = True C.bn_eps = 1e-5 C.bn_momentum = 0.1 C.pretrained_model = “/home/heal/TorchSeg-master/pytorch_model/resnet18_v1.pth”
“”“Train Config”“” C.lr = 1e-2 C.lr_power = 0.9 C.momentum = 0.9 C.weight_decay = 5e-4 C.batch_size = 8 #4 * C.num_gpu C.nepochs = 150 C.niters_per_epoch = 420 C.num_workers = 4 C.train_scale_array = [0.75, 1, 1.25, 1.5, 1.75, 2.0]
“”“Eval Config”“” C.eval_iter = 30 C.eval_stride_rate = 5 / 6 C.eval_scale_array = [1, ] # 0.5, 0.75, 1, 1.25, 1.5, 1.75 C.eval_flip = False C.eval_base_size =512 C.eval_crop_size =512
“”“Display Config”“” C.snapshot_iter = 50 C.record_info_iter = 20 C.display_iter = 50
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (1 by maintainers)
According to my experience, this is mainly because your label value is not in the range of
0 ~ config.num_classes-1
.