question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

distributed error encountered

See original GitHub issue

❓ Questions and Help

I tried to use just P2-P4 of FPN and just modified a few lines of code. The code works well on a single GPU but when using more than one GPUs, the error bellow is encountered.

Traceback (most recent call last):
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 251, in <module>
    main()
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 244, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 153, in train
    arguments,
  File "/root/txy1/mask-rcnn/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 81, in do_train
    losses.backward()
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 384, in distributed_data_parallel_hook
    self._queue_reduction(bucket_idx)
  File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 413, in _queue_reduction
    self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7ffb97f8c180>, [[tensor([[[[0.]],

The main modifications that I made is in forward function of fpn.py

        # just use P2-P4 rather than P2-P5
        # use_P5 is bool, FPN outputs P2-P5 when use_P5==True and P2-P4 when False
        if not self.use_P5:
            results.pop()

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:34 (14 by maintainers)

github_iconTop GitHub Comments

20reactions
chengyangfucommented, Feb 18, 2019

I also met the same problem and the solution is to remove the non-used parameters.

Try to add the following line in the code.

for name, param in model. named_parameters(): 
    print(name, param, True if param.grad is not None else False)

After backward, if the parameter does not contain grad, it means the parameter is either frozen or not used in the forward.

5reactions
Lausannencommented, Feb 4, 2019

Hi, I have met same problem with distributed training. The same error is raised @txytju and @HOPEver1991 referred. I tried to use one node with multi GPUs but it failed to backward. I will try to provide a minimum reprobucible example but since I have changed a lot in this repository, it will cost some time. I will be appreciate if you can provide some suggestions! Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Encountered a Distributed Cleanse Error on multinode ...
EU0L has failed. Shutting down all distributed parts. SVR1.EU0L..... CleanseException: SIP-11066: A serious error occurred while trying to perform distributed ...
Read more >
"we encountered an error while computing the distribution ...
I have a CSV file that I download from a local machine via an HTTP call in Excel built using the Power Query...
Read more >
How to deal with errors in a distributed system
How to deal with errors in a distributed system ; Image Load Error. Solution: Try loading again a few times. And/or display a...
Read more >
CTREE error 652 when trying to backup Nutanix distributed ...
Hello, I am trying to set up a backup job on a Nutanix file cluster. There are three shares. Two are distributed shares...
Read more >
On Embracing Error in Distributed Software Systems
Consider an HTTP request that fails with an HTTP 404 Not Found error because the resource does not exist. An instant later, the...
Read more >

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found