question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed validation

See original GitHub issue

❓ Questions and Help

I’m working on a branch where I implemented validation inference at every checkpoint.

Everything was working fine until the new changes from torch.deprecated.distributed to the new torch.distributed

Now either the Dataloader breaks on one of the processes or, if I run the inference in the main process, it hangs there forever.

sample code:

        if iteration % checkpoint_period == 0 or iteration == max_iter:
            checkpointer.save("model_{:07d}".format(iteration), **arguments)
            if is_main_process():
                if val_data_loader is not None:
                    logger.info('Evaluating on validation data set')
                    iou_types = ("bbox",)
                    if cfg.MODEL.MASK_ON:
                        iou_types = iou_types + ("segm",)

                    inference(
                        model,
                        val_data_loader,
                        iou_types=iou_types,
                        box_only=cfg.MODEL.RPN_ONLY,
                        device=cfg.MODEL.DEVICE,
                        expected_results=cfg.TEST.EXPECTED_RESULTS,
                        expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
                        verbose=False
                    )

                model.train()  # reset training flag
            synchronize()

Any work around this?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
BobZhangHTcommented, Jan 13, 2019

@fmassa Hi. I additionally tracked the training every iteration and found that it can actually train for the first 7 or 9 iterations, then break with the following error. It seems that something’s index is out of bound in cuda but I can not figure out what it is. Here is the part of the error information, since some of them are in a repeated pattern so I just provide a small fraction.

/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [76,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [77,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [78,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [79,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [62,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [29,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch-nightly_1547199076991/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [8,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
....
Traceback (most recent call last):
  File "tools/train_net.py", line 237, in <module>
    main()
  File "tools/train_net.py", line 230, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 138, in train
    arguments,)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 109, in do_train
    loss_dict = model(images, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 364, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 116, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
    sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 113, in forward_for_single_feature_map
    boxlist = remove_small_boxes(boxlist, self.min_size)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 46, in remove_small_boxes
    (ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called without an active exception
terminate called without an active exception

Update: I also found that it may take place occasionally even for single GPU training.

2reactions
BobZhangHTcommented, Jan 13, 2019

@fmassa

Thanks for your reply. I modified my codes and it did not hang on since then. Actually I successfully run several rounds after that and suddenly got stuck in a new problem about distributed training. Here is the log, and I update my pytorch to the latest version but it does not work.

File "tools/train_net.py", line 230, in <module>
    main()
  File "tools/train_net.py", line 223, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 121, in train
    masksgt,)# add 2019/01/11 
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 106, in do_train
    loss_dict = model(images, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 364, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 116, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "/home/r7user3/anaconda2/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 138, in forward
    sampled_boxes.append(self.forward_for_single_feature_map(a, o, b))
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/inference.py", line 113, in forward_for_single_feature_map
    boxlist = remove_small_boxes(boxlist, self.min_size)
  File "/home/r7user3/ZhangHT/github/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 46, in remove_small_boxes
    (ws >= min_size) & (hs >= min_size)
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called without an active exception
Read more comments on GitHub >

github_iconTop Results From Across the Web

Centralized vs Distributed Data Validation - Service Objects Blog
Distributed approaches. In this case, the validation of each of your data points is distributed across multiple vendors. This approach will ...
Read more >
How to validate in DistributedDataParallel correctly? - distributed
I am trying to train and validate model using DistributedDataParallel. Everything is fine during training, but when the model starts ...
Read more >
Distributed Validator Technology on Eth2 | by Mara Schmiedt
The DVT middleware allows at-home validators to distribute their validator signing power across a distributed set of active-active redundant ...
Read more >
Understanding Distributed Validator Technology (DVT)
Distributed Validator Technology (DVT) is a decentralized open-source protocol that allows the duties of a validator to be distributed among ...
Read more >
Distributed validation with Keras and Tensorflow #3535 - GitHub
As the optimizer object is the cornerstone for data distribution with Keras+Tensorflow, I have little idea on cues to investigate. This page ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found