question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`Blas xGEMV launch failed` but it worked on the same GPU last week

See original GitHub issue

tensorflow : 2.8.2+zzzcolab20220929150707 Driver Version: 460.32.03
CUDA Version: 11.2 GPU: Tesla T4 Tier: Colab Pro - I have “compute units” enough.

/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     65     except Exception as e:  # pylint: disable=broad-except
     66       filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67       raise e.with_traceback(filtered_tb) from None
     68     finally:
     69       del filtered_tb

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
   7162 def raise_from_not_ok_status(e, name):
   7163   e.message += (" name: " + name if name is not None else "")
-> 7164   raise core._status_to_exception(e) from None  # pylint: disable=protected-access
   7165 
   7166 

InternalError: Exception encountered when calling layer "dense_1" (type Dense).

Blas xGEMV launch failed : a.shape=[1,4704000,8], b.shape=[1,8,1], m=4704000, n=1, k=8 [Op:MatMul]

Call arguments received by layer "dense_1" (type Dense):
  • inputs=tf.Tensor(shape=(196, 24000, 8), dtype=float32)

I found out this error suggests that out of memory. But this NN training was working correctly on Tesla T4 last week. (Oct. 6th)

Now, training with the same codes and same datasets can not work. I tried with tf 2.7.4 and 2.9.2 but not working (same error).

I also compared the driver version and CUDA version from another project which was executed ‘!nvidia-smi’ last week and working correctly. Both versions are the same.

Additionally, another one does not work now because it is the same NN and uses really similar datasets.

Has Colab updated something? Should I update or downgrade something like CUDA version?

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
metrizablecommented, Oct 18, 2022

Thanks for sharing the notebook. I was able to reproduce the error from the OP: Blas xGEMV launch failed...

I have determined that the error raised in your notebook is due to the version of cuBLAS available in Colab. A similar issue was captured in https://github.com/tensorflow/tensorflow/issues/54463 and is noted in the cuBLAS release notes as fixed in version 11.4: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-11.4.0

A work-around you may want to try that I had success with is to install libcublas-11-4 as a first cell:

!apt-get install libcublas-11-4

Screenshot from 2022-10-18 14-43-20

Then execute the remainder of your notebook as normal:

Screenshot from 2022-10-18 14-44-22

This may provide a way forward until an upgraded libcublas can be brought into Colab.

0reactions
arvindrajan92commented, Oct 25, 2022

Hi @wp45rw,

As I have mentioned here, would you mind trying downgrading from to CUDA 11.1?

If you are using conda environment, following installation steps in https://www.tensorflow.org/install/pip, conda install -c conda-forge cudatoolkit=11.1 cudatoolkit-dev=11.1 cudnn=8.1.0 -y sets up my environment perfectly. I have been doing this to circumvent the issue with CUDA 11.2 and you might find this helpful as well.

Test With CUDA 11.1 (No Error)

conda create -n cuda111 python=3.9.12 -y
conda activate cuda111
conda install -c conda-forge cudatoolkit=11.1 cudatoolkit-dev=11.1 cudnn=8.1.0 -y
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install tensorflow==2.9.1
python -c "import tensorflow as tf; empty_image = tf.zeros(shape=[1280, 1280, 3], dtype=tf.float32); gray_image = tf.image.rgb_to_grayscale(empty_image); print(tf.shape(gray_image))"

Test With CUDA 11.2 (Throws INTERNAL: Blas xGEMV launch failed Error)

conda create -n cuda112 python=3.9.12 -y
conda activate cuda112
conda install -c conda-forge cudatoolkit=11.2 cudatoolkit-dev=11.2 cudnn=8.1.0 -y
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install tensorflow==2.9.1
python -c "import tensorflow as tf; empty_image = tf.zeros(shape=[1280, 1280, 3], dtype=tf.float32); gray_image = tf.image.rgb_to_grayscale(empty_image); print(tf.shape(gray_image))"
Read more comments on GitHub >

github_iconTop Results From Across the Web

TensorFlow Blas GEMM launch failed - Stack Overflow
Show activity on this post. limiting the GPU memory growth doesn't work for me. Instead, removing the contents of ~/.nv solved in my...
Read more >
How to fix the TensorFlow GPU 'Blas GEMM launch failed' error
This is useful if you want to truly bound the amount of GPU memory available to the TensorFlow process. config = tf.ConfigProto(). config.gpu_options....
Read more >
Error Internal: Blas GEMM launch failed
Hi, I an encountering an error when I moved to a new laptop with RTX3070. I am new to GPU world and I...
Read more >
5 Signs Your Graphics Card Has Problems and May Be Dying
Graphics cards can fail in several different ways, but there are usually warning ... In cases like this, the video card will usually...
Read more >
Setup and use CUDA and TensorFlow in Windows Subsystem ...
The diagram shows Microsoft Windows GPU machines running on the NVIDIA hardware. For the software. This article walks through the installation ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found