deform_conv2d, CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
See original GitHub issue🐛 Bug
Opening a new issue for the bug reported here, I could reproduce as well on nighlty build: https://github.com/pytorch/vision/issues/2598#issuecomment-896921180
Thanks to @Queuecumber
1.10.0.dev20210726+cu111 0.11.0a0+c51f8c1
Total memory used before DFC call: 5.321267579510999%
Traceback (most recent call last):
File "repro_vision_2598.py", line 58, in <module>
test_out = dfc(test_in, test_offset)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1060, in _call_impl
return forward_call(*input, **kwargs)
File "repro_vision_2598.py", line 40, in forward
res = deform_conv2d(input=x, offset=offset, weight=self.weight, stride=_pair(self.stride), padding=_pair(self.padding), dilation=_pair(self.dilation), mask=mask)
File "/vision/torchvision/ops/deform_conv.py", line 89, in deform_conv2d
return torch.ops.torchvision.deform_conv2d(
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Environment
- PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):
- OS (e.g., Linux):
- How you installed PyTorch / torchvision (
conda
,pip
, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:12 (1 by maintainers)
Top Results From Across the Web
ops.deform_conv2d causes CUDA illegal memory access #2598
Bug I try to test the speed of deformable conv2d. But always encountered memory error. To Reproduce $ ipython Python 3.8.5 (default, ...
Read more >DeformConv2d — Torchvision main documentation - PyTorch
Parameters: input (Tensor[batch_size, in_channels, in_height, in_width]) – input tensor. offset (Tensor[batch_size, 2 * offset_groups * kernel_height * ...
Read more >deform_conv2d, CUBLAS_STATUS_ALLOC_FAILED when ...
in the beginning of the script to initialize cublas handle in advance, then my script errors out with plain OOM.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I still couldn’t repro it on a 16 GB card (was getting honest OOMs), but I thinks what’s happening is for bs=23
columns
has more than2**31
elements https://github.com/pytorch/vision/blob/7d52be76c8eaf02b12338afe0822396ab3547fe2/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu#L1079-L1080, (217055232, to be exact), and the im2col kernel is using int32 addressing, so some address computations are overflowing. The fix would be to either make the kernel templated and use int64 index computation when necessary, or instead of limitingn_parallel_ings
to constkMaxParallelImgs
computen_parallel_imgs
in such a way so thecolumns
has fewer than2**31
elements.Seems to be working, thanks a lot for the fix and sorry for my late reply