Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

torch.randperm() in cuda that have a wrong values when the n(int) have a big value(n > 2^12)

See original GitHub issue

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

When I use the tutorials in TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL, I use the same code in the tutorials
In the first time, I use the cpu, it is just ok, but in the evaluate, the net still use the GPU to evaluate, it is the first matter
Then I use the GPU with Cuda to train, I just have one GPU, in the train, the model tell that RuntimeError: CUDA error: device-side assert triggered. I use the pytorch 1.8.1, vision 0.91
Then I debug the all code, find that some pic is ok, not all. I find that when in the loss calculation, the downsample of the pos and neg have the bug. It use a funtion in _utils.py named that BalancedPositiveNegativeSampler(), it use torch.randperm(positive.numel(), device=positive.device)[:num_pos] to generate a ramdon index
But I see the function return a wrong values, it is a very big value, such as 4755801207605297152, and the positive.numel() is 265826, so I try different num of int to return.Finally, I find in my computer, when the n >2^12, it will failed to return a right index list.I think the limit of the n is relate to the GPU channel or the others.
I think your code to generate a random index should have a judgment, if the given num larger than the limit, it should force use of CPU

wrong history.txt

Expected behavior

Returns a random permutation of integers from 0 to n - 1

Environment

Please copy and paste the output from our wrong message.txt

(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):
OS (e.g., Linux):
How you installed PyTorch / torchvision (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

wrong envs.txt

Issue Analytics

State:
Created 2 years ago
Comments:16 (6 by maintainers)

Top GitHub Comments

1reaction

datumboxcommented, May 12, 2021

Hi @nothingwithyou,

I would recommend to open a new ticket on PyTorch and provide a set of the minimum commands to reproduce it. From what you describe something like torch.randperm(265826, device='cuda').max() should be enough to show-case any potential issue.

Unfortunately when I run the above command, I don’t get any values larger than n. See below:

>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()

I would also advise to use the latest PyTorch nightly and see if the problem is resolved.

0reactions

dtch1997commented, Dec 9, 2022

I also encountered the issue where torch.randperm on device:cuda0 returned very large values outside the expected bounds whereas torch.randperm on CPU worked correctly.

As a workaround, I initialized the randperm on CPU and then moved it to the GPU using torch.randperm(n).to(device)