[Bug] cublas RuntimeError in fantasy update test on CUDA
See original GitHub issue🐛 Bug
Running test.examples.test_simple_gp_regression.TestSimpleGPRegression.test_fantasy_updates
routinely results in the following cublas error: RuntimeError: cublas runtime error : an invalid numeric value was used as an argument
This only happens for the cuda test, the cpu test runs fine. Also, anecdotally, I haven’t seen this happen on all runs / all types of machines, but it happening pretty consistently.
To reproduce
Run the test on a cuda machine.
** Stack trace/error message **
> test_fantasy_updates_cuda (test.examples.test_simple_gp_regression.TestSimpleGPRegression) ... ERROR
>
> ======================================================================
> ERROR: test_fantasy_updates_cuda (test.examples.test_simple_gp_regression.TestSimpleGPRegression)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
> File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/test/examples/test_simple_gp_regression.py", line 265, in test_fantasy_updates_cuda
> self.test_fantasy_updates(cuda=True)
> File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/test/examples/test_simple_gp_regression.py", line 308, in test_fantasy_updates
> test_function_predictions.mean.sum().backward()
> File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/torch/tensor.py", line 118, in backward
> torch.autograd.backward(self, gradient, retain_graph, create_graph)
> File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/torch/autograd/__init__.py", line 93, in backward
> allow_unreachable=True) # allow_unreachable flag
> RuntimeError: cublas runtime error : an invalid numeric value was used as an argument at caffe2/aten/src/THC/THCBlas.cu:120
>
>
> ActivityProfiler - start thread
> ** On entry to SGER parameter number 7 had an illegal value
Expected Behavior
There shouldn’t be a difference between cpu and cuda tests.
System information
gpytorch master, pytorch master, linux
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Cublas Bug - GPU-Accelerated Libraries
Hi, I think I have met a bug about cublas. ... and it raise an error: RuntimeError: CUDA error: an illegal memory access...
Read more >CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when ...
I set os.environ['CUDA_LAUNCH_BLOCKING'] = "1" command after I got RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling ...
Read more >Runtime error when translating using ctranslate2 - Support
I finally managed to reproduce the issue by testing with an older GPU driver. The version of the cuBLAS library included in the...
Read more >Cublas runtime error on GPU running, but works on CPU
This error can appear when tensor dimension mismatch. But the whole thing works on CPU, so it won't have unmatched tensor size. This...
Read more >runtimeerror: cuda error: cublas_status_not_initialized when ...
The most likely reason is that there is an inconsistency between number of labels and number of output units. Try printing the size...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Wow, this seems pretty serious. I guess we’ve been lucky to mostly do batched MVMs as matrix-matrix multiplies in the code?
It’s actually known and due to legacy reasons they are moving things over to ATen now, will comment on the other issue.