Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed validation

See original GitHub issue

❓ Questions and Help

I’m working on a branch where I implemented validation inference at every checkpoint.

Everything was working fine until the new changes from torch.deprecated.distributed to the new torch.distributed

Now either the Dataloader breaks on one of the processes or, if I run the inference in the main process, it hangs there forever.

sample code:

        if iteration % checkpoint_period == 0 or iteration == max_iter:
            checkpointer.save("model_{:07d}".format(iteration), **arguments)
            if is_main_process():
                if val_data_loader is not None:
                    logger.info('Evaluating on validation data set')
                    iou_types = ("bbox",)
                    if cfg.MODEL.MASK_ON:
                        iou_types = iou_types + ("segm",)

                    inference(
                        model,
                        val_data_loader,
                        iou_types=iou_types,
                        box_only=cfg.MODEL.RPN_ONLY,
                        device=cfg.MODEL.DEVICE,
                        expected_results=cfg.TEST.EXPECTED_RESULTS,
                        expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
                        verbose=False
                    )

                model.train()  # reset training flag
            synchronize()

Any work around this?

Issue Analytics

State:
Created 5 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

2reactions

BobZhangHTcommented, Jan 13, 2019

@fmassa Hi. I additionally tracked the training every iteration and found that it can actually train for the first 7 or 9 iterations, then break with the following error. It seems that something’s index is out of bound in cuda but I can not figure out what it is. Here is the part of the error information, since some of them are in a repeated pattern so I just provide a small fraction.

/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [76,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [77,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [78,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [79,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [8,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
....
Traceback (most recent call last):
  File "tools/train_net.py", line 237, in <module>
    main()
  File "tools/train_net.py", line 230, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 138, in train
    arguments,)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 109, in do_train
    loss_dict = model(images, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 364, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 116, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
    sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 113, in forward_for_single_feature_map
    boxlist = remove_small_boxes(boxlist, self.min_size)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 46, in remove_small_boxes
    (ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called without an active exception
terminate called without an active exception

Update: I also found that it may take place occasionally even for single GPU training.

2reactions

BobZhangHTcommented, Jan 13, 2019

@fmassa

Thanks for your reply. I modified my codes and it did not hang on since then. Actually I successfully run several rounds after that and suddenly got stuck in a new problem about distributed training. Here is the log, and I update my pytorch to the latest version but it does not work.

File "tools/train_net.py", line 230, in <module>
    main()
  File "tools/train_net.py", line 223, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 121, in train
    masksgt,)# add 2019/01/11 
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 106, in do_train
    loss_dict = model(images, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 364, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 116, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
    sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 113, in forward_for_single_feature_map
    boxlist = remove_small_boxes(boxlist, self.min_size)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 46, in remove_small_boxes
    (ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called without an active exception