question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

torch.randperm() in cuda that have a wrong values when the n(int) have a big value(n > 2^12)

See original GitHub issue

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. When I use the tutorials in TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL, I use the same code in the tutorials
  2. In the first time, I use the cpu, it is just ok, but in the evaluate, the net still use the GPU to evaluate, it is the first matter
  3. Then I use the GPU with Cuda to train, I just have one GPU, in the train, the model tell that RuntimeError: CUDA error: device-side assert triggered. I use the pytorch 1.8.1, vision 0.91
  4. Then I debug the all code, find that some pic is ok, not all. I find that when in the loss calculation, the downsample of the pos and neg have the bug. It use a funtion in _utils.py named that BalancedPositiveNegativeSampler(), it use torch.randperm(positive.numel(), device=positive.device)[:num_pos] to generate a ramdon index
  5. But I see the function return a wrong values, it is a very big value, such as 4755801207605297152, and the positive.numel() is 265826, so I try different num of int to return.Finally, I find in my computer, when the n >2^12, it will failed to return a right index list.I think the limit of the n is relate to the GPU channel or the others.
  6. I think your code to generate a random index should have a judgment, if the given num larger than the limit, it should force use of CPU

wrong history.txt

Expected behavior

Returns a random permutation of integers from 0 to n - 1

Environment

Please copy and paste the output from our wrong message.txt

(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):
  • OS (e.g., Linux):
  • How you installed PyTorch / torchvision (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

wrong envs.txt

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
datumboxcommented, May 12, 2021

Hi @nothingwithyou,

I would recommend to open a new ticket on PyTorch and provide a set of the minimum commands to reproduce it. From what you describe something like torch.randperm(265826, device='cuda').max() should be enough to show-case any potential issue.

Unfortunately when I run the above command, I don’t get any values larger than n. See below:

>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()
tensor(265825, device='cuda:0')
>>> torch.randperm(265826, device='cuda').max()

I would also advise to use the latest PyTorch nightly and see if the problem is resolved.

0reactions
dtch1997commented, Dec 9, 2022

I also encountered the issue where torch.randperm on device:cuda0 returned very large values outside the expected bounds whereas torch.randperm on CPU worked correctly.

As a workaround, I initialized the randperm on CPU and then moved it to the GPU using torch.randperm(n).to(device)

Read more comments on GitHub >

github_iconTop Results From Across the Web

torch.randperm — PyTorch 1.13 documentation
Returns a random permutation of integers from 0 to n - 1 . Parameters: n (int) – the upper bound (exclusive). Keyword Arguments:....
Read more >
torch.pdf - The Comprehensive R Archive Network
numeric instances (scalars) are upcast to tensors having the same size and type as the first tensor passed to values. If all the...
Read more >
working locally · samuelinferences/transformers-can-do ...
+ "Running locally at: http://127.0.0.1:7860/\n",. 30. + "To create a public link, set `share=True` in `launch()`.\n".
Read more >
FIXED: Gradual_Warmup, Custom Head - Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources.
Read more >
Random Choice with Pytorch? - python - Stack Overflow
torch has no equivalent implementation of np.random.choice() , see the discussion here. ... sampled_values = values[torch.randperm(N)[:k]].
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found