Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

deform_conv2d, CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

See original GitHub issue

🐛 Bug

Opening a new issue for the bug reported here, I could reproduce as well on nighlty build: https://github.com/pytorch/vision/issues/2598#issuecomment-896921180

Thanks to @Queuecumber

1.10.0.dev20210726+cu111 0.11.0a0+c51f8c1
Total memory used before DFC call: 5.321267579510999%
Traceback (most recent call last):
  File "repro_vision_2598.py", line 58, in <module>
    test_out = dfc(test_in, test_offset)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1060, in _call_impl
    return forward_call(*input, **kwargs)
  File "repro_vision_2598.py", line 40, in forward
    res = deform_conv2d(input=x, offset=offset, weight=self.weight, stride=_pair(self.stride), padding=_pair(self.padding), dilation=_pair(self.dilation), mask=mask)
  File "/vision/torchvision/ops/deform_conv.py", line 89, in deform_conv2d
    return torch.ops.torchvision.deform_conv2d(
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

Environment

PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):
OS (e.g., Linux):
How you installed PyTorch / torchvision (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:12 (1 by maintainers)

Top GitHub Comments

2reactions

ngimelcommented, Aug 12, 2021

I still couldn’t repro it on a 16 GB card (was getting honest OOMs), but I thinks what’s happening is for bs=23 columns has more than 2**31 elements https://github.com/pytorch/vision/blob/7d52be76c8eaf02b12338afe0822396ab3547fe2/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu#L1079-L1080, (217055232, to be exact), and the im2col kernel is using int32 addressing, so some address computations are overflowing. The fix would be to either make the kernel templated and use int64 index computation when necessary, or instead of limiting n_parallel_ings to const kMaxParallelImgs compute n_parallel_imgs in such a way so the columns has fewer than 2**31 elements.

1reaction

Queuecumbercommented, Sep 23, 2021

Seems to be working, thanks a lot for the fix and sorry for my late reply