distributed error encountered
See original GitHub issue❓ Questions and Help
I tried to use just P2-P4 of FPN and just modified a few lines of code. The code works well on a single GPU but when using more than one GPUs, the error bellow is encountered.
Traceback (most recent call last):
File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 251, in <module>
main()
File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 244, in main
model = train(cfg, args.local_rank, args.distributed)
File "/root/txy1/mask-rcnn/maskrcnn-benchmark/tools/train_net.py", line 153, in train
arguments,
File "/root/txy1/mask-rcnn/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 81, in do_train
losses.backward()
File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 384, in distributed_data_parallel_hook
self._queue_reduction(bucket_idx)
File "/opt/conda/envs/maskrcnn_benchmark/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 413, in _queue_reduction
self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7ffb97f8c180>, [[tensor([[[[0.]],
The main modifications that I made is in forward function of fpn.py
# just use P2-P4 rather than P2-P5
# use_P5 is bool, FPN outputs P2-P5 when use_P5==True and P2-P4 when False
if not self.use_P5:
results.pop()
Issue Analytics
- State:
- Created 5 years ago
- Comments:34 (14 by maintainers)
Top Results From Across the Web
Encountered a Distributed Cleanse Error on multinode ...
EU0L has failed. Shutting down all distributed parts. SVR1.EU0L..... CleanseException: SIP-11066: A serious error occurred while trying to perform distributed ...
Read more >"we encountered an error while computing the distribution ...
I have a CSV file that I download from a local machine via an HTTP call in Excel built using the Power Query...
Read more >How to deal with errors in a distributed system
How to deal with errors in a distributed system ; Image Load Error. Solution: Try loading again a few times. And/or display a...
Read more >CTREE error 652 when trying to backup Nutanix distributed ...
Hello, I am trying to set up a backup job on a Nutanix file cluster. There are three shares. Two are distributed shares...
Read more >On Embracing Error in Distributed Software Systems
Consider an HTTP request that fails with an HTTP 404 Not Found error because the resource does not exist. An instant later, the...
Read more >Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I also met the same problem and the solution is to remove the non-used parameters.
Try to add the following line in the code.
After
backward
, if the parameter does not containgrad
, it means the parameter is either frozen or not used in the forward.Hi, I have met same problem with distributed training. The same error is raised @txytju and @HOPEver1991 referred. I tried to use one node with multi GPUs but it failed to backward. I will try to provide a minimum reprobucible example but since I have changed a lot in this repository, it will cost some time. I will be appreciate if you can provide some suggestions! Thanks!