`deepspeed.comm.barrier()` has different signatures/behaviour from `torch.distributed.barrier()`
See original GitHub issueIt seems the deepspeed.comm.barrier()
has different signatures/behaviour from torch.distributed.barrier()
. Is this intended?
reference: torch.distributed.barrier() https://github.com/pytorch/pytorch/blob/07dd2fe6c32948e5ca0a2871e5eb31602a9684cf/torch/distributed/distributed_c10d.py#L3182
Also, is monitored_barrier()
supported in deepspeed.comm
? Thanks!
Issue Analytics
- State:
- Created 10 months ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Source code for deepspeed.comm.comm
synchronize() # If we're using MPI, we can't simply sync the stream if cdb.using_mpi: cdb.barrier() if ('prof' ...
Read more >`torch.distributed.barrier` used in multi-node ... - PyTorch Forums
Hello, I was trying to improve one of my multi-node distributed training ... 0: torch.distributed.barrier() # Create directories outside the ...
Read more >Slow processing with map when using deepspeed or fairscale
In a distributed setting, you may use caching and a torch.distributed.barrier() to make sure that only the main process performs the mapping ...
Read more >How does torch.distributed.barrier() work - Stack Overflow
They wait there, because barrier() blocks until all processes have reached a barrier, but the base process has not reached a barrier yet....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jeffra – Sure!
I can also add
monitored_barrier()
support for you, @HeyangQin@Quentin-Anthony can you take a look at this? I think the
barrier
function signature needs to supportbarrier(group=GroupMember.WORLD, async_op=False, device_ids=None)