Is torch.distributed.all_reduce working as expected?
See original GitHub issueThis line https://github.com/facebookresearch/barlowtwins/blob/main/main.py#L208 use torch.distributed.all_reduce
to sum the correlation matrices across all gpus. However as I know this op is not dedicated for forward computation where backward computation would run later. Instead, to apply “correctly differentiable” distributed all reduce, the official PyTorch document recommends using torch.distributed.nn.*
: https://pytorch.org/docs/stable/distributed.html#autograd-enabled-communication-primitives
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Distributed communication package - torch.distributed - PyTorch
The torch.distributed package provides PyTorch support and communication primitives for multiprocess parallelism across several computation nodes running on ...
Read more >Distributed.all_reduce bandwidth expectations
I want to benchmark how quickly PyTorch with the Gloo backend is able to all-reduce all-gather a model synchronously.
Read more >Too much time spent in `ncclKernel AllReduce`? - distributed
When distributing the training, is it expected that half of GPU time is spent on ncclKernel_AllReduce_RING_LL_Sum_float ? Below are more details ...
Read more >PyTorch Distributed Overview
Use torch.distributed.elastic to launch distributed training if errors (e.g., out-of-memory) are expected or if resources can join and leave dynamically ...
Read more >Distributed.all_reduce returns strange results - PyTorch Forums
Is the above error expected? How did you handle this? If this is handled by skipping/redoing that iteration, it might cause allreduce mismatch....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
After doing some tests with differentiable allgather I realize that your implementation is an equivalent version. Very smart tricks.
I can confirm that
torch.distributed.nn.all_reduce
is mathematically incorrect: https://github.com/pytorch/pytorch/issues/58005torch.distributed.all_reduce
is correct, but seems to be by accident rather than by design.