question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`deepspeed.comm.barrier()` has different signatures/behaviour from `torch.distributed.barrier()`

See original GitHub issue

https://github.com/microsoft/DeepSpeed/blob/4abf637f96cd36d450125a38a15d21df1cf0b8db/deepspeed/comm/comm.py#L456-L458

It seems the deepspeed.comm.barrier() has different signatures/behaviour from torch.distributed.barrier(). Is this intended?

reference: torch.distributed.barrier() https://github.com/pytorch/pytorch/blob/07dd2fe6c32948e5ca0a2871e5eb31602a9684cf/torch/distributed/distributed_c10d.py#L3182

Also, is monitored_barrier() supported in deepspeed.comm? Thanks!

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
Quentin-Anthonycommented, Dec 2, 2022

@jeffra – Sure!

I can also add monitored_barrier() support for you, @HeyangQin

1reaction
jeffracommented, Dec 2, 2022

@Quentin-Anthony can you take a look at this? I think the barrier function signature needs to support barrier(group=GroupMember.WORLD, async_op=False, device_ids=None)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for deepspeed.comm.comm
synchronize() # If we're using MPI, we can't simply sync the stream if cdb.using_mpi: cdb.barrier() if ('prof' ...
Read more >
`torch.distributed.barrier` used in multi-node ... - PyTorch Forums
Hello, I was trying to improve one of my multi-node distributed training ... 0: torch.distributed.barrier() # Create directories outside the ...
Read more >
Slow processing with map when using deepspeed or fairscale
In a distributed setting, you may use caching and a torch.distributed.barrier() to make sure that only the main process performs the mapping ...
Read more >
How does torch.distributed.barrier() work - Stack Overflow
They wait there, because barrier() blocks until all processes have reached a barrier, but the base process has not reached a barrier yet....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found