Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error running module on GPU

See original GitHub issue

Hi Clement, I have been using your spatial-correlation-sampler. When I run python grad_check cpu it returns ‘ok’ but with backend option set as cuda, I get the following error:

Traceback (most recent call last): File “grad_check.py”, line 41, in <module> if gradcheck(correlation_sampler, [input1, input2]): File “/home/bowen/anaconda3/lib/python3.6/site-packages/torch/autograd/gradcheck.py”, line 190, in gradcheck output = _differentiable_outputs(func(inputs)) File “/home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler/spatial_correlation_sampler.py”, line 73, in forward dH, dW) RuntimeError: sizes must be non-negative (THCTensor_resizeNd at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCTensor.cpp:108) frame #0: THCudaDoubleTensor_newWithStorage + 0xfa (0x7fe7c042c9aa in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #1: at::CUDADoubleType::th_tensor(at::ArrayRef<long>) const + 0xa5 (0x7fe7c02cf4e5 in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #2: at::native::tensor(at::Type const&, at::ArrayRef<long>) + 0x3a (0x7fe7e34c37da in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #3: at::Type::tensor(at::ArrayRef<long>) const + 0x9 (0x7fe7e36b1b69 in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #4: torch::autograd::VariableType::tensor(at::ArrayRef<long>) const + 0x44 (0x7fe7e5333d04 in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so) frame #5: at::native::zeros(at::ArrayRef<long>, at::TensorOptions const&) + 0x31 (0x7fe7e351a651 in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #6: <unknown function> + 0x231f9 (0x7fe7bca6e1f9 in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) frame #7: correlation_cuda_forward(at::Tensor, at::Tensor, int, int, int, int, int, int, int, int, int, int) + 0x167 (0x7fe7bca6f204 in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) frame #8: correlation_sample_forward(at::Tensor, at::Tensor, int, int, int, int, int, int, int, int, int, int) + 0x1b4 (0x7fe7bca60ed4 in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) frame #9: <unknown function> + 0x217b9 (0x7fe7bca6c7b9 in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) frame #10: <unknown function> + 0x1fc2d (0x7fe7bca6ac2d in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) <omitting python frames> frame #18: THPFunction_do_forward(THPFunction, _object*) + 0x2ad (0x7fe7e5310fbd in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so) frame #35: __libc_start_main + 0xf5 (0x7fe7f9c8af45 in /lib/x86_64-linux-gnu/libc.so.6)

I also tried to pass some tensors into the module in the console. Calling forward and backward seems to be fine as long as computation is on cpu. The module crashes once I put the same tensor on GPU.

Code is running on Ubuntu 14.04. I am using anaconda python 3.6.5. My pytorch version is 0.4.1. And I am using CUDA 9.0.

It is worth mentioning that I had trouble installing the code, as I later discovered that nvcc is using gcc-6.0 which is incompatible with my CUDA 9.0 and I redirected the softlink to gcc-4.9. By doing this I am able to compile and install. Not sure if it is related to this crash.

Please let me know what you think. Thank you!

Issue Analytics

State:
Created 5 years ago
Comments:23 (11 by maintainers)

Top GitHub Comments

2reactions

BK1024commented, Sep 14, 2018

Hi Clement, I am so sorry. Ignore what I have said before about the error. So I have further tested my script, it is not about module. SpatialCorrelationSampler only works on with cuda device set to 0. I have not yet tried to manually set visible GPU ids from CUDA side and see what happens.

0reactions

ClementPinardcommented, Dec 8, 2020

FINALLY fixed, thanks @InnovArul !

Top Results From Across the Web

read errors and "load module" error #52 - wilicc/gpu-burn

I'm getting various errors running gpu_burn on a newly assembled system. GPU 0: Tesla K80 (UUID: GPU-6b15ee7a-dc27-9450-1be9-43cff0faef9b) ...

GPU Error - CUDA Programming and Performance

Hi Nvidia Team we are getting Error below mention error when we run the Command nvidia-smi nvidia-smi Unable to determine the device handle...

"Module nvidia is in use" but there are no processes running ...

I have a problem when I try to run 'sudo rmmod nvidia_modeset'. This is block by a process called nvidia-pe (I suppose this...

GPU task error: "Could not load UVM kernel module. Is nvidia ...

I am trying to run a task that uses a GPU. The docker image is based off "nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04" and the runtime block ...

Error while running code on gpu - Google Groups

This the error i'm getting when I'm trying to run the following command : CUDA_VISIBLE_DEVICES=0 THEANO_FLAGS=mode=FAST_RUN,device=cuda0,floatX=float32 ...