Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to all_gather Tensor if not the same length

See original GitHub issue

❓ Questions/Help/Support

Hi @vfdev-5 ,

I am developing distributed evaluation feature, and facing a problem that the preds and labels on different GPUs don’t have the same length, then ignite.idist.all_gather() can’t work. For example: GPU0 has 5 images to handle, GPU1 has 4 images, total=9 images. Could you please help on how to idist.all_gather() the values? I don’t want to pad data for the input to make it evenly-divisible, because it will cause the metrics different on single GPU and multi-GPUs.

Thanks in advance.

Issue Analytics

State:
Created 3 years ago
Comments:12 (7 by maintainers)

Top GitHub Comments

2reactions

Nic-Macommented, Jan 26, 2021

Hi @vfdev-5 ,

Your example code looks good, and I developed a evenly_divisible_all_gather() in MONAI now to handle this case:

def evenly_divisible_all_gather(data: torch.Tensor):
    """
    Utility function for distributed data parallel to pad tensor to make it evenly divisible for all_gather.
    Args:
        data: source tensor to pad and execute all_gather in distributed data parallel.

    """
    if idist.get_world_size() <= 1:
        return data
    # make sure the data is evenly-divisible on multi-GPUs
    length = data.shape[0]
    all_lens = idist.all_gather(length)
    max_len = max(all_lens).item()
    if length < max_len:
        size = [max_len - length] + list(data.shape[1:])
        data = torch.cat([data, data.new_full(size, float("NaN"))], dim=0)
    # all gather across all processes
    data = idist.all_gather(data)
    # delete the padding NaN items
    return torch.cat([data[i * max_len : i * max_len + l, ...] for i, l in enumerate(all_lens)], dim=0)

Thanks.

1reaction

Nic-Macommented, Jan 25, 2021

I compared your program with mine and found the root cause, sorry for my mistake, the all_gather works for my string now. Thanks very much for your help and example program!!!