Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Correlation always zero if multiple GPUs

See original GitHub issue

The following code snippet reproduces the bug:

import torch
from spatial_correlation_sampler import spatial_correlation_sample


def run_spatial_corr(rank):
    corr = spatial_correlation_sample(torch.ones(1, 512, 12, 27).to(f"cuda:{rank}"),
                                      torch.ones(1, 512, 12, 27).to(f"cuda:{rank}")).mean()
    print(corr)

run_spatial_corr(0)
run_spatial_corr(1)

The expected output is:

tensor(512., device='cuda:0')
tensor(512., device='cuda:1')

However, it returns:

tensor(512., device='cuda:0')
tensor(0., device='cuda:1')

The output is as expected if the device ordinals are the same or everything is executed on the CPU. I run the code with Python 3.7 and PyTorch 1.2.

Issue Analytics

State:
Created 4 years ago
Comments:15 (14 by maintainers)

Top GitHub Comments

1reaction

ClementPinardcommented, Dec 8, 2020

There is indeed something fishy here. We should clarify this on pytorch repo. Maybe we’ll see what they have to say regarding issues on tutorials such as the one you linked or https://github.com/pytorch/tutorials/issues/1196

0reactions

InnovArulcommented, Dec 8, 2020

I have this fundamental doubt.

Should the custom kernel creators take care of setting the Guard (in this case, we can add it to the pytorch tutorial)? or Should pytorch itself take care of it internally in some way or provide a user-friendly API to set the device?