Error running module on GPU
See original GitHub issueHi Clement, I have been using your spatial-correlation-sampler. When I run python grad_check cpu
it returns ‘ok’ but with backend option set as cuda, I get the following error:
Traceback (most recent call last): File “grad_check.py”, line 41, in <module> if gradcheck(correlation_sampler, [input1, input2]): File “/home/bowen/anaconda3/lib/python3.6/site-packages/torch/autograd/gradcheck.py”, line 190, in gradcheck output = _differentiable_outputs(func(inputs)) File “/home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler/spatial_correlation_sampler.py”, line 73, in forward dH, dW) RuntimeError: sizes must be non-negative (THCTensor_resizeNd at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCTensor.cpp:108) frame #0: THCudaDoubleTensor_newWithStorage + 0xfa (0x7fe7c042c9aa in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #1: at::CUDADoubleType::th_tensor(at::ArrayRef<long>) const + 0xa5 (0x7fe7c02cf4e5 in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #2: at::native::tensor(at::Type const&, at::ArrayRef<long>) + 0x3a (0x7fe7e34c37da in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #3: at::Type::tensor(at::ArrayRef<long>) const + 0x9 (0x7fe7e36b1b69 in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #4: torch::autograd::VariableType::tensor(at::ArrayRef<long>) const + 0x44 (0x7fe7e5333d04 in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so) frame #5: at::native::zeros(at::ArrayRef<long>, at::TensorOptions const&) + 0x31 (0x7fe7e351a651 in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #6: <unknown function> + 0x231f9 (0x7fe7bca6e1f9 in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) frame #7: correlation_cuda_forward(at::Tensor, at::Tensor, int, int, int, int, int, int, int, int, int, int) + 0x167 (0x7fe7bca6f204 in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) frame #8: correlation_sample_forward(at::Tensor, at::Tensor, int, int, int, int, int, int, int, int, int, int) + 0x1b4 (0x7fe7bca60ed4 in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) frame #9: <unknown function> + 0x217b9 (0x7fe7bca6c7b9 in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) frame #10: <unknown function> + 0x1fc2d (0x7fe7bca6ac2d in /home/bowen/anaconda3/lib/python3.6/site-packages/spatial_correlation_sampler_backend.cpython-36m-x86_64-linux-gnu.so) <omitting python frames> frame #18: THPFunction_do_forward(THPFunction, _object*) + 0x2ad (0x7fe7e5310fbd in /home/bowen/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so) frame #35: __libc_start_main + 0xf5 (0x7fe7f9c8af45 in /lib/x86_64-linux-gnu/libc.so.6)
I also tried to pass some tensors into the module in the console. Calling forward and backward seems to be fine as long as computation is on cpu. The module crashes once I put the same tensor on GPU.
Code is running on Ubuntu 14.04. I am using anaconda python 3.6.5. My pytorch version is 0.4.1. And I am using CUDA 9.0.
It is worth mentioning that I had trouble installing the code, as I later discovered that nvcc is using gcc-6.0 which is incompatible with my CUDA 9.0 and I redirected the softlink to gcc-4.9. By doing this I am able to compile and install. Not sure if it is related to this crash.
Please let me know what you think. Thank you!
Issue Analytics
- State:
- Created 5 years ago
- Comments:23 (11 by maintainers)
Top GitHub Comments
Hi Clement, I am so sorry. Ignore what I have said before about the error. So I have further tested my script, it is not about module. SpatialCorrelationSampler only works on with cuda device set to 0. I have not yet tried to manually set visible GPU ids from CUDA side and see what happens.
FINALLY fixed, thanks @InnovArul !