Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: merge_sort: failed to synchronize: device-side assert triggered

See original GitHub issue

sorry to bother you , i got this problem when i run train.py /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [122,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [123,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [124,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [125,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [126,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [93,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [94,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [263,0,0], thread: [62,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [263,0,0], thread: [63,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. Traceback (most recent call last): File “train.py”, line 133, in <module> loss = model(imgs, gts) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call result = self.forward(*input, **kwargs) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 83, in parallel_apply raise output File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 59, in _worker output = module(*input, **kwargs) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call result = self.forward(*input, **kwargs) File “/home/heal/TorchSeg-master/model/bisenet/cityscapes.bisenet.R18/network.py”, line 105, in forward aux_loss0 = self.ohem_criterion(self.heads0, label) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call result = self.forward(*input, **kwargs) File “/home/heal/TorchSeg-master/furnace/seg_opr/loss_opr.py”, line 85, in forward index = mask_prob.argsort() File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/tensor.py”, line 248, in argsort return torch.argsort(self, dim, descending) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/functional.py”, line 648, in argsort return torch.sort(input, -1, descending)[1] RuntimeError: merge_sort: failed to synchronize: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “train.py”, line 167, in <module> config.log_dir_link) File “/home/heal/TorchSeg-master/furnace/engine/engine.py”, line 154, in exit torch.cuda.empty_cache() File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py”, line 374, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered

and this is my dataset:class Camvid(BaseDataset): trans_labels = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27, 28,29,30,31,32]

@classmethod
def get_class_colors(*args):
    return [[64,128,64],[192,0,128],[0,128,192],[0,128,64],[128,0,0],[64,0,128],[64,0,192],[192,128,64],[192,192,128],
            [64,64,128],[128,0,192],[192,0,64],[128,128,64],[192,0,192],[128,64,64],[64,192,128],[64,64,0],[128,64,128],
            [128,128,92],[0,0,192],[192,128,128],[128,128,128],[64,128,192],[0,0,64],[0,64,64],[192,64,128],[128,128,0]
            [192,128,192][64,0,64][192,192,0][0,0,0],[64,192,0]]

@classmethod
def get_class_names(*args):
    return ['Animal',

‘Archway’, ‘Bicyclist’, ‘Bridge’, ‘Building’, ‘Car’, ‘CartLuggagePram’, ‘Child’, ‘Column_Pole’, ‘Fence’, ‘LaneMkgsDriv’, ‘LaneMkgsNonDriv’, ‘Misc_Text’, ‘MotorcycleScooter’, ‘OtherMoving’, ‘ParkingBlock’, ‘Pedestrian’, ‘Road’, ‘RoadShoulder’, ‘Sidewalk’, ‘SignSymbol’, ‘Sky’, ‘SUVPickupTruck’, ‘TrafficCone’, ‘TrafficLight’, ‘Train’, ‘Tree’, ‘Truck_Bus’, ‘Tunnel’, ‘VegetationMisc’, ‘Void’, ‘Wall’, ]

this is my config:C = edict() config = C cfg = C

C.seed = 12345

“”“please config ROOT_dir and user when u first using”“” C.repo_name = ‘TorchSeg’ C.abs_dir = osp.realpath(“.”) C.this_dir = C.abs_dir.split(osp.sep)[-1] C.root_dir = C.abs_dir[:C.abs_dir.index(C.repo_name) + len(C.repo_name)] C.log_dir = osp.abspath(osp.join(C.root_dir, ‘log’, C.this_dir)) C.log_dir_link = osp.join(C.abs_dir, ‘log’) C.snapshot_dir = osp.abspath(osp.join(C.log_dir, “snapshot”))

exp_time = time.strftime(‘%Y_%m_%d_%H_%M_%S’, time.localtime()) C.log_file = C.log_dir + ‘/log_’ + exp_time + ‘.log’ C.link_log_file = C.log_file + ‘/log_last.log’ C.val_log_file = C.log_dir + ‘/val_’ + exp_time + ‘.log’ C.link_val_log_file = C.log_dir + ‘/val_last.log’

“”“Data Dir and Weight Dir”“” C.dataset_path = “/home/heal/TorchSeg-master/data/CamVid/” C.img_root_folder = C.dataset_path C.gt_root_folder = C.dataset_path C.train_source = osp.join(C.dataset_path, “train.txt”) C.eval_source = osp.join(C.dataset_path, “val.txt”) C.test_source = osp.join(C.dataset_path, “test.txt”) C.is_test = False

“”“Path Config”“”

def add_path(path): if path not in sys.path: sys.path.insert(0, path)

add_path(osp.join(C.root_dir, ‘furnace’))

=============================================================================

from torch.utils.pyt_utils import model_urls

=============================================================================

“”“Image Config”“” C.num_classes = 32 C.background = 0 C.image_mean = np.array([0.485, 0.456, 0.406]) # 0.485, 0.456, 0.406 C.image_std = np.array([0.229, 0.224, 0.225]) C.target_size = 512 C.image_height = 512 C.image_width = 512 C.num_train_imgs = 420 C.num_eval_imgs = 20

“”" Settings for network, this would be different for each kind of model"“” C.fix_bias = True C.fix_bn = False C.sync_bn = True C.bn_eps = 1e-5 C.bn_momentum = 0.1 C.pretrained_model = “/home/heal/TorchSeg-master/pytorch_model/resnet18_v1.pth”

“”“Train Config”“” C.lr = 1e-2 C.lr_power = 0.9 C.momentum = 0.9 C.weight_decay = 5e-4 C.batch_size = 8 #4 * C.num_gpu C.nepochs = 150 C.niters_per_epoch = 420 C.num_workers = 4 C.train_scale_array = [0.75, 1, 1.25, 1.5, 1.75, 2.0]

“”“Eval Config”“” C.eval_iter = 30 C.eval_stride_rate = 5 / 6 C.eval_scale_array = [1, ] # 0.5, 0.75, 1, 1.25, 1.5, 1.75 C.eval_flip = False C.eval_base_size =512 C.eval_crop_size =512

“”“Display Config”“” C.snapshot_iter = 50 C.record_info_iter = 20 C.display_iter = 50

Issue Analytics

State:
Created 4 years ago
Comments:8 (1 by maintainers)

Top GitHub Comments

5reactions

ycszencommented, Aug 1, 2019

According to my experience, this is mainly because your label value is not in the range of 0 ~ config.num_classes-1.

2reactions

memeda2232commented, Sep 24, 2019

@memeda2232 hello, I also met the same problem, did you find any solution to solve it? yes, you change the label as author says will solve this problem

Top Results From Across the Web

RuntimeError: reduce failed to synchronize: device-side assert ...

In my case, the issue was caused because binary cross entropy expected the input values to be between 0~1, but I was sending...

reduce failed to synchronize: device-side assert triggered问题 ...

解决RuntimeError: reduce failed to synchronize: device-side assert ... pytorch runtime error(59):device-side assert triggered at XXX.

CUDA Error: Device-Side Assert Triggered: Solved | Built In

The code above will trigger a CUDA runtime error 59 if you are using a GPU. You can fix it by passing your...

Release 0.57.0.dev0+927.g61e4b01a0.dirty-py3.8-linux

Should the compilation in nopython mode fail, Numba can compile using object ... This example demonstrates that calling f() with mixed types caused...

Changelog | Thrust

While some Thrust algorithms require internal synchronization to safely compute their ... NVIDIA/thrust#1329: Fix runtime error when copying an empty ...