question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How does the `synchronize` function work?

See original GitHub issue

❓ Questions and Help

I found that in your codes you use synchronize helper function to synchronize gpus. I am not familiar with the usage of torch.distributed.deprecated, and I am trying to understand how does the following codes actually work.

def synchronize():
    """
    Helper function to synchronize between multiple processes when
    using distributed training
    """
    if not torch.distributed.deprecated.is_initialized():
        return
    world_size = torch.distributed.deprecated.get_world_size()
    rank = torch.distributed.deprecated.get_rank()
    if world_size == 1:
        return

    def _send_and_wait(r):
        if rank == r:
            tensor = torch.tensor(0, device="cuda")
        else:
            tensor = torch.tensor(1, device="cuda")
        torch.distributed.deprecated.broadcast(tensor, r)
        while tensor.item() == 1:
            time.sleep(1)

    _send_and_wait(0)
    # now sync on the main process
    _send_and_wait(1)

I get some questions here.

  1. Will broadcast block the main process until all other processes receive the tensor? What are the behaviors of this function in different type of processes?
  2. Why use _send_and_wait(1) at the end of this function? I know that rank 0 is the master, but what is special for rank 1?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:29 (27 by maintainers)

github_iconTop GitHub Comments

3reactions
yelantfcommented, Jan 25, 2019

Oh, yes, I didn’t use coco and error did not always happen. It happens more frequently when I run multiple copy of inference codes on a single machine at the same time. All 8 GPUs are the same. I use 0-3 to run a copy of codes and 4-7 to run another. I think maybe the cause is the high workload of CPU.

3reactions
pieterncommented, Jan 24, 2019

Yes, this happens because of the lazy initialization in the NCCL backend. The faster process will try to create a new NCCL communicator and is waiting for the slower process to do the same. This times out after 5 minutes. This timeout is set on the k/v store (be it a file backed store or TCP store where a single process acts as server) and is currently not configurable.

This is a dup of pytorch/pytorch#16225 so this one can be closed and we can continue discussion there.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is File Synchronization and How Does it Work? - Raysync
One-way synchronization (also known as mirroring) pushes files from your source platform to the destination system. On the other hand, two-way ...
Read more >
File synchronization - Wikipedia
File synchronization (or syncing) in computing is the process of ensuring that computer files in two or more locations are updated via certain...
Read more >
What Does Sync Mean? - Experience - Dropbox
When you synchronize files, you're telling them to update the same way across two or more of your devices. That means you no...
Read more >
What is sync? What does sync mean? - cloudHQ Support
Synchronization (or syncing, sync) in computing is the process of continuously ensuring that data (files, email, notes, documents, etc.) in one location are...
Read more >
Syncing Part I: How Does Syncing Work? - adidas Runtastic
When you sync (short for synchronize) a device such as a phone or an ipod, you're synchronizing the data on the device with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found