question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to all_gather Tensor if not the same length

See original GitHub issue

❓ Questions/Help/Support

Hi @vfdev-5 ,

I am developing distributed evaluation feature, and facing a problem that the preds and labels on different GPUs don’t have the same length, then ignite.idist.all_gather() can’t work. For example: GPU0 has 5 images to handle, GPU1 has 4 images, total=9 images. Could you please help on how to idist.all_gather() the values? I don’t want to pad data for the input to make it evenly-divisible, because it will cause the metrics different on single GPU and multi-GPUs.

Thanks in advance.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
Nic-Macommented, Jan 26, 2021

Hi @vfdev-5 ,

Your example code looks good, and I developed a evenly_divisible_all_gather() in MONAI now to handle this case:

def evenly_divisible_all_gather(data: torch.Tensor):
    """
    Utility function for distributed data parallel to pad tensor to make it evenly divisible for all_gather.
    Args:
        data: source tensor to pad and execute all_gather in distributed data parallel.

    """
    if idist.get_world_size() <= 1:
        return data
    # make sure the data is evenly-divisible on multi-GPUs
    length = data.shape[0]
    all_lens = idist.all_gather(length)
    max_len = max(all_lens).item()
    if length < max_len:
        size = [max_len - length] + list(data.shape[1:])
        data = torch.cat([data, data.new_full(size, float("NaN"))], dim=0)
    # all gather across all processes
    data = idist.all_gather(data)
    # delete the padding NaN items
    return torch.cat([data[i * max_len : i * max_len + l, ...] for i, l in enumerate(all_lens)], dim=0)

Thanks.

1reaction
Nic-Macommented, Jan 25, 2021

I compared your program with mine and found the root cause, sorry for my mistake, the all_gather works for my string now. Thanks very much for your help and example program!!!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Gather/Concatenate tensor arrays of different lengths/sizes
Use dist.all_gather to get sizes of all arrays. Find the max size. Pad local array to max size using zeros/constants. Use ...
Read more >
horovod.torch.mpi_ops
If name is not provided, an incremented auto-generated name is used. The tensor type and shape must be the same on all Horovod...
Read more >
Distributed communication package - torch.distributed - PyTorch
This function requires that all processes in the main group (i.e. all processes that are part of the distributed job) enter this function,...
Read more >
Operation Semantics | XLA - TensorFlow
If a replica id is not a target in any pair, then the output on that replica is a tensor consists of 0(s)...
Read more >
Parallel computing - Pytorch distributed - Google Sites
Pytorch has several supports for distributed version which is similar to MPI. ... def run(rank, size): tensor = torch.zeros(1) if rank == 0:...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found