torch.randperm() in cuda that have a wrong values when the n(int) have a big value(n > 2^12)
See original GitHub issue🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- When I use the tutorials in TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL, I use the same code in the tutorials
- In the first time, I use the cpu, it is just ok, but in the evaluate, the net still use the GPU to evaluate, it is the first matter
- Then I use the GPU with Cuda to train, I just have one GPU, in the train, the model tell that RuntimeError: CUDA error: device-side assert triggered. I use the pytorch 1.8.1, vision 0.91
- Then I debug the all code, find that some pic is ok, not all. I find that when in the loss calculation, the downsample of the pos and neg have the bug. It use a funtion in _utils.py named that BalancedPositiveNegativeSampler(), it use
torch.randperm(positive.numel(), device=positive.device)[:num_pos]
to generate a ramdon index - But I see the function return a wrong values, it is a very big value, such as 4755801207605297152, and the
positive.numel()
is 265826, so I try different num of int to return.Finally, I find in my computer, when the n >2^12, it will failed to return a right index list.I think the limit of the n is relate to the GPU channel or the others. - I think your code to generate a random index should have a judgment, if the given num larger than the limit, it should force use of CPU
Expected behavior
Returns a random permutation of integers from 0 to n - 1
Environment
Please copy and paste the output from our wrong message.txt
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
- PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):
- OS (e.g., Linux):
- How you installed PyTorch / torchvision (
conda
,pip
, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (6 by maintainers)
Top Results From Across the Web
torch.randperm — PyTorch 1.13 documentation
Returns a random permutation of integers from 0 to n - 1 . Parameters: n (int) – the upper bound (exclusive). Keyword Arguments:....
Read more >torch.pdf - The Comprehensive R Archive Network
numeric instances (scalars) are upcast to tensors having the same size and type as the first tensor passed to values. If all the...
Read more >working locally · samuelinferences/transformers-can-do ...
+ "Running locally at: http://127.0.0.1:7860/\n",. 30. + "To create a public link, set `share=True` in `launch()`.\n".
Read more >FIXED: Gradual_Warmup, Custom Head - Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources.
Read more >Random Choice with Pytorch? - python - Stack Overflow
torch has no equivalent implementation of np.random.choice() , see the discussion here. ... sampled_values = values[torch.randperm(N)[:k]].
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @nothingwithyou,
I would recommend to open a new ticket on PyTorch and provide a set of the minimum commands to reproduce it. From what you describe something like
torch.randperm(265826, device='cuda').max()
should be enough to show-case any potential issue.Unfortunately when I run the above command, I don’t get any values larger than
n
. See below:I would also advise to use the latest PyTorch nightly and see if the problem is resolved.
I also encountered the issue where
torch.randperm
ondevice:cuda0
returned very large values outside the expected bounds whereastorch.randperm
on CPU worked correctly.As a workaround, I initialized the randperm on CPU and then moved it to the GPU using
torch.randperm(n).to(device)