question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: merge_sort: failed to synchronize: device-side assert triggered

See original GitHub issue

sorry to bother you , i got this problem when i run train.py /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [122,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [123,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [124,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [125,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [126,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [151,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [93,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [94,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [207,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [263,0,0], thread: [62,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1549636813070/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [263,0,0], thread: [63,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. Traceback (most recent call last): File “train.py”, line 133, in <module> loss = model(imgs, gts) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call result = self.forward(*input, **kwargs) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 83, in parallel_apply raise output File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 59, in _worker output = module(*input, **kwargs) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call result = self.forward(*input, **kwargs) File “/home/heal/TorchSeg-master/model/bisenet/cityscapes.bisenet.R18/network.py”, line 105, in forward aux_loss0 = self.ohem_criterion(self.heads0, label) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 489, in call result = self.forward(*input, **kwargs) File “/home/heal/TorchSeg-master/furnace/seg_opr/loss_opr.py”, line 85, in forward index = mask_prob.argsort() File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/tensor.py”, line 248, in argsort return torch.argsort(self, dim, descending) File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/functional.py”, line 648, in argsort return torch.sort(input, -1, descending)[1] RuntimeError: merge_sort: failed to synchronize: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “train.py”, line 167, in <module> config.log_dir_link) File “/home/heal/TorchSeg-master/furnace/engine/engine.py”, line 154, in exit torch.cuda.empty_cache() File “/home/heal/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py”, line 374, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered

and this is my dataset:class Camvid(BaseDataset): trans_labels = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27, 28,29,30,31,32]

@classmethod
def get_class_colors(*args):
    return [[64,128,64],[192,0,128],[0,128,192],[0,128,64],[128,0,0],[64,0,128],[64,0,192],[192,128,64],[192,192,128],
            [64,64,128],[128,0,192],[192,0,64],[128,128,64],[192,0,192],[128,64,64],[64,192,128],[64,64,0],[128,64,128],
            [128,128,92],[0,0,192],[192,128,128],[128,128,128],[64,128,192],[0,0,64],[0,64,64],[192,64,128],[128,128,0]
            [192,128,192][64,0,64][192,192,0][0,0,0],[64,192,0]]

@classmethod
def get_class_names(*args):
    return ['Animal',

‘Archway’, ‘Bicyclist’, ‘Bridge’, ‘Building’, ‘Car’, ‘CartLuggagePram’, ‘Child’, ‘Column_Pole’, ‘Fence’, ‘LaneMkgsDriv’, ‘LaneMkgsNonDriv’, ‘Misc_Text’, ‘MotorcycleScooter’, ‘OtherMoving’, ‘ParkingBlock’, ‘Pedestrian’, ‘Road’, ‘RoadShoulder’, ‘Sidewalk’, ‘SignSymbol’, ‘Sky’, ‘SUVPickupTruck’, ‘TrafficCone’, ‘TrafficLight’, ‘Train’, ‘Tree’, ‘Truck_Bus’, ‘Tunnel’, ‘VegetationMisc’, ‘Void’, ‘Wall’, ]

this is my config:C = edict() config = C cfg = C

C.seed = 12345

“”“please config ROOT_dir and user when u first using”“” C.repo_name = ‘TorchSeg’ C.abs_dir = osp.realpath(“.”) C.this_dir = C.abs_dir.split(osp.sep)[-1] C.root_dir = C.abs_dir[:C.abs_dir.index(C.repo_name) + len(C.repo_name)] C.log_dir = osp.abspath(osp.join(C.root_dir, ‘log’, C.this_dir)) C.log_dir_link = osp.join(C.abs_dir, ‘log’) C.snapshot_dir = osp.abspath(osp.join(C.log_dir, “snapshot”))

exp_time = time.strftime(‘%Y_%m_%d_%H_%M_%S’, time.localtime()) C.log_file = C.log_dir + ‘/log_’ + exp_time + ‘.log’ C.link_log_file = C.log_file + ‘/log_last.log’ C.val_log_file = C.log_dir + ‘/val_’ + exp_time + ‘.log’ C.link_val_log_file = C.log_dir + ‘/val_last.log’

“”“Data Dir and Weight Dir”“” C.dataset_path = “/home/heal/TorchSeg-master/data/CamVid/” C.img_root_folder = C.dataset_path C.gt_root_folder = C.dataset_path C.train_source = osp.join(C.dataset_path, “train.txt”) C.eval_source = osp.join(C.dataset_path, “val.txt”) C.test_source = osp.join(C.dataset_path, “test.txt”) C.is_test = False

“”“Path Config”“”

def add_path(path): if path not in sys.path: sys.path.insert(0, path)

add_path(osp.join(C.root_dir, ‘furnace’))

=============================================================================

from torch.utils.pyt_utils import model_urls

=============================================================================

“”“Image Config”“” C.num_classes = 32 C.background = 0 C.image_mean = np.array([0.485, 0.456, 0.406]) # 0.485, 0.456, 0.406 C.image_std = np.array([0.229, 0.224, 0.225]) C.target_size = 512 C.image_height = 512 C.image_width = 512 C.num_train_imgs = 420 C.num_eval_imgs = 20

“”" Settings for network, this would be different for each kind of model"“” C.fix_bias = True C.fix_bn = False C.sync_bn = True C.bn_eps = 1e-5 C.bn_momentum = 0.1 C.pretrained_model = “/home/heal/TorchSeg-master/pytorch_model/resnet18_v1.pth”

“”“Train Config”“” C.lr = 1e-2 C.lr_power = 0.9 C.momentum = 0.9 C.weight_decay = 5e-4 C.batch_size = 8 #4 * C.num_gpu C.nepochs = 150 C.niters_per_epoch = 420 C.num_workers = 4 C.train_scale_array = [0.75, 1, 1.25, 1.5, 1.75, 2.0]

“”“Eval Config”“” C.eval_iter = 30 C.eval_stride_rate = 5 / 6 C.eval_scale_array = [1, ] # 0.5, 0.75, 1, 1.25, 1.5, 1.75 C.eval_flip = False C.eval_base_size =512 C.eval_crop_size =512

“”“Display Config”“” C.snapshot_iter = 50 C.record_info_iter = 20 C.display_iter = 50

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

5reactions
ycszencommented, Aug 1, 2019

According to my experience, this is mainly because your label value is not in the range of 0 ~ config.num_classes-1.

2reactions
memeda2232commented, Sep 24, 2019

@memeda2232 hello, I also met the same problem, did you find any solution to solve it? yes, you change the label as author says will solve this problem

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: reduce failed to synchronize: device-side assert ...
In my case, the issue was caused because binary cross entropy expected the input values to be between 0~1, but I was sending...
Read more >
reduce failed to synchronize: device-side assert triggered问题 ...
解决RuntimeError: reduce failed to synchronize: device-side assert ... pytorch runtime error(59):device-side assert triggered at XXX.
Read more >
CUDA Error: Device-Side Assert Triggered: Solved | Built In
The code above will trigger a CUDA runtime error 59 if you are using a GPU. You can fix it by passing your...
Read more >
Release 0.57.0.dev0+927.g61e4b01a0.dirty-py3.8-linux
Should the compilation in nopython mode fail, Numba can compile using object ... This example demonstrates that calling f() with mixed types caused...
Read more >
Changelog | Thrust
While some Thrust algorithms require internal synchronization to safely compute their ... NVIDIA/thrust#1329: Fix runtime error when copying an empty ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found