question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] cublas RuntimeError in fantasy update test on CUDA

See original GitHub issue

🐛 Bug

Running test.examples.test_simple_gp_regression.TestSimpleGPRegression.test_fantasy_updates routinely results in the following cublas error: RuntimeError: cublas runtime error : an invalid numeric value was used as an argument

This only happens for the cuda test, the cpu test runs fine. Also, anecdotally, I haven’t seen this happen on all runs / all types of machines, but it happening pretty consistently.

To reproduce

Run the test on a cuda machine.

** Stack trace/error message **

> test_fantasy_updates_cuda (test.examples.test_simple_gp_regression.TestSimpleGPRegression) ... ERROR
>
> ======================================================================
> ERROR: test_fantasy_updates_cuda (test.examples.test_simple_gp_regression.TestSimpleGPRegression)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/test/examples/test_simple_gp_regression.py", line 265, in test_fantasy_updates_cuda
>     self.test_fantasy_updates(cuda=True)
>   File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/test/examples/test_simple_gp_regression.py", line 308, in test_fantasy_updates
>     test_function_predictions.mean.sum().backward()
>   File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/torch/tensor.py", line 118, in backward
>     torch.autograd.backward(self, gradient, retain_graph, create_graph)
>   File "/data/users/balandat/fbsource/fbcode/buck-out/dev/gen/pytorch/gpytorch/test_gpytorch_examples#binary,link-tree/torch/autograd/__init__.py", line 93, in backward
>     allow_unreachable=True)  # allow_unreachable flag
> RuntimeError: cublas runtime error : an invalid numeric value was used as an argument at caffe2/aten/src/THC/THCBlas.cu:120
>
>
> ActivityProfiler - start thread
>  ** On entry to SGER   parameter number 7 had an illegal value

Expected Behavior

There shouldn’t be a difference between cpu and cuda tests.

System information

gpytorch master, pytorch master, linux

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Balandatcommented, May 12, 2020

Wow, this seems pretty serious. I guess we’ve been lucky to mostly do batched MVMs as matrix-matrix multiplies in the code?

0reactions
Balandatcommented, May 12, 2020

It’s actually known and due to legacy reasons they are moving things over to ATen now, will comment on the other issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cublas Bug - GPU-Accelerated Libraries
Hi, I think I have met a bug about cublas. ... and it raise an error: RuntimeError: CUDA error: an illegal memory access...
Read more >
CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when ...
I set os.environ['CUDA_LAUNCH_BLOCKING'] = "1" command after I got RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling ...
Read more >
Runtime error when translating using ctranslate2 - Support
I finally managed to reproduce the issue by testing with an older GPU driver. The version of the cuBLAS library included in the...
Read more >
Cublas runtime error on GPU running, but works on CPU
This error can appear when tensor dimension mismatch. But the whole thing works on CPU, so it won't have unmatched tensor size. This...
Read more >
runtimeerror: cuda error: cublas_status_not_initialized when ...
The most likely reason is that there is an inconsistency between number of labels and number of output units. Try printing the size...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found