Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal
See original GitHub issueDear Yoshitaka,
I was able to successfully install ColabFold in a machine runnine Ubuntu 20 and a Tesla K40c GPU. Running a prediction is completed, although it takes ~24 hs to be completed (430 residues long), with command
colabfold_batch --amber --templates --num-recycle 3 test.fasta /home/pc08/Desktop/test_AF
During calculation, I get the following messages:
E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 23994040320 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:211] Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal: INTERNAL: All algorithms tried for %cublas-gemm.21 = f32[1908,128]{1,0} custom-call(f32[1908,128]{1,0} %bitcast.319, f32[128,128]{1,0} %bitcast.321), custom_call_target="__cublas$gemm", backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"lhs_stride\":\"244224\",\"rhs_stride\":\"16384\"}" failed. Falling back to default algorithm.
Let me repeat, that nevertheless, the calculations of the 5 models are completed, and looking good. But I would like to know if this can be improved to run it more efficiently. Not sure if it’s useful, but I’ve noticed that the GPU is only sporadically used (with GPU’s memory used to the max the whole time), and one CPU being used at 100% during most of the run. If you have any insights on how to resolve this, I would greatly appreciate it. Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
Hello, I have an update for this. Today I ran the same test, but this time WITHOUT the commands indicated in the issues:
Now, the calculations were completed, with the GPU being used at capacity and taking one hour for a ~430 residues model.
Still, the
Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal
warnings are showing up, and the memory is also at ~97%, but at least the process is completed and much faster than before.Thanks, Fernando
Hi Yoshitaka, I have tried again installing cuDNN for my CUDA 11.4, and running the commands suggested in the issues you suggested
export XLA_FLAGS=--xla_gpu_force_compilation_parallelism=1
export TF_FORCE_UNIFIED_MEMORY="1"
export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"
export XLA_PYTHON_CLIENT_ALLOCATOR="platform"
export TF_FORCE_GPU_ALLOW_GROWTH="true"
Even though the same warnings were showing in the terminal, I was surprised to see that now the GPU was being used almost at capacity (GPU’s memory was also >97% used). However, after a while, the process stopped with the following message:
The complete terminal’s output in the text file. Any suggestions on how to proceed?
Thanks, Fernando issue_colabfold.txt