RuntimeError: cuSolver internal error
See original GitHub issueDid anyone solve the cuSolver internal error?
I0716 21:59:04.723145 139668278384448 run_docker.py:180] WARNING:tensorflow:From /app/alphafold/alphafold/model/tf/input_pipeline.py:151: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
I0716 21:59:04.723469 139668278384448 run_docker.py:180] Instructions for updating:
I0716 21:59:04.723672 139668278384448 run_docker.py:180] Use fn_output_signature instead
I0716 21:59:04.723857 139668278384448 run_docker.py:180] W0716 13:59:04.722220 140425547745088 deprecation.py:528] From /app/alphafold/alphafold/model/tf/input_pipeline.py:151: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a fut
ure version.
I0716 21:59:04.724043 139668278384448 run_docker.py:180] Instructions for updating:
I0716 21:59:04.724218 139668278384448 run_docker.py:180] Use fn_output_signature instead
I0716 21:59:08.106853 139668278384448 run_docker.py:180] 2021-07-16 13:59:08.105871: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; L
D_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I0716 21:59:08.107234 139668278384448 run_docker.py:180] 2021-07-16 13:59:08.105914: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GP
U. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
I0716 21:59:08.107458 139668278384448 run_docker.py:180] Skipping registering GPU devices...
I0716 21:59:09.326027 139668278384448 run_docker.py:180] I0716 13:59:09.324977 140425547745088 model.py:132] Running predict with shape(feat) = {'aatype': (4, 44), 'residue_index': (4, 44), 'seq_length': (4,), 'template_aatype': (4, 4, 44), 'template_all_atom_masks': (4, 4, 44, 37
), 'template_all_atom_positions': (4, 4, 44, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 44), 'msa_mask': (4, 508, 44), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 44, 3
), 'template_pseudo_beta_mask': (4, 4, 44), 'atom14_atom_exists': (4, 44, 14), 'residx_atom14_to_atom37': (4, 44, 14), 'residx_atom37_to_atom14': (4, 44, 37), 'atom37_atom_exists': (4, 44, 37), 'extra_msa': (4, 5120, 44), 'extra_msa_mask': (4, 5120, 44), 'extra_msa_row_mask': (4,
5120), 'bert_mask': (4, 508, 44), 'true_msa': (4, 508, 44), 'extra_has_deletion': (4, 5120, 44), 'extra_deletion_value': (4, 5120, 44), 'msa_feat': (4, 508, 44, 49), 'target_feat': (4, 44, 22)}
I0716 21:59:58.832500 139668278384448 run_docker.py:180] 2021-07-16 13:59:58.831660: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object fi
le: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I0716 21:59:58.929851 139668278384448 run_docker.py:180] Traceback (most recent call last):
I0716 21:59:58.930104 139668278384448 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 283, in <module>
I0716 21:59:58.930308 139668278384448 run_docker.py:180] app.run(main)
I0716 21:59:58.930593 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
I0716 21:59:58.930781 139668278384448 run_docker.py:180] _run_main(main, args)
I0716 21:59:58.930959 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
I0716 21:59:58.931135 139668278384448 run_docker.py:180] sys.exit(main(argv))
I0716 21:59:58.931310 139668278384448 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 255, in main
I0716 21:59:58.931483 139668278384448 run_docker.py:180] predict_structure(
I0716 21:59:58.931658 139668278384448 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 137, in predict_structure
I0716 21:59:58.931832 139668278384448 run_docker.py:180] prediction_result = model_runner.predict(processed_feature_dict)
I0716 21:59:58.932008 139668278384448 run_docker.py:180] File "/app/alphafold/alphafold/model/model.py", line 134, in predict
I0716 21:59:58.932183 139668278384448 run_docker.py:180] result = self.apply(self.params, jax.random.PRNGKey(0), feat)
I0716 21:59:58.932358 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
I0716 21:59:58.932534 139668278384448 run_docker.py:180] return fun(*args, **kwargs)
I0716 21:59:58.932709 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/_src/api.py", line 424, in cache_miss
I0716 21:59:58.932884 139668278384448 run_docker.py:180] out_flat = xla.xla_call(
I0716 21:59:58.933057 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/core.py", line 1560, in bind
I0716 21:59:58.933231 139668278384448 run_docker.py:180] return call_bind(self, fun, *args, **params)
I0716 21:59:58.933458 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/core.py", line 1551, in call_bind
I0716 21:59:58.933639 139668278384448 run_docker.py:180] outs = primitive.process(top_trace, fun, tracers, params)
I0716 21:59:58.933813 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/core.py", line 1563, in process
I0716 21:59:58.933987 139668278384448 run_docker.py:180] return trace.process_call(self, fun, tracers, params)
I0716 21:59:58.934161 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/core.py", line 606, in process_call
I0716 21:59:58.934336 139668278384448 run_docker.py:180] return primitive.impl(f, *tracers, **params)
I0716 21:59:58.934510 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/interpreters/xla.py", line 592, in _xla_call_impl
I0716 21:59:58.934684 139668278384448 run_docker.py:180] compiled_fun = _xla_callable(fun, device, backend, name, donated_invars,
I0716 21:59:58.934857 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/linear_util.py", line 262, in memoized_fun
I0716 21:59:58.935029 139668278384448 run_docker.py:180] ans = call(fun, *args)
I0716 21:59:58.935202 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/interpreters/xla.py", line 723, in _xla_callable
I0716 21:59:58.935374 139668278384448 run_docker.py:180] out_nodes = jaxpr_subcomp(
I0716 21:59:58.935548 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp
I0716 21:59:58.935724 139668278384448 run_docker.py:180] ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),
I0716 21:59:58.935896 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/_src/lax/control_flow.py", line 350, in _while_loop_translation_rule
I0716 21:59:58.936069 139668278384448 run_docker.py:180] new_z = xla.jaxpr_subcomp(body_c, body_jaxpr.jaxpr, backend, axis_env,
I0716 21:59:58.936244 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp
I0716 21:59:58.936418 139668278384448 run_docker.py:180] ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),
I0716 21:59:58.936592 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/interpreters/xla.py", line 1040, in f
I0716 21:59:58.936766 139668278384448 run_docker.py:180] outs = jaxpr_subcomp(c, jaxpr, backend, axis_env, _xla_consts(c, consts),
I0716 21:59:58.936941 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp
I0716 21:59:58.937116 139668278384448 run_docker.py:180] ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),
I0716 21:59:58.937289 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/_src/lax/control_flow.py", line 350, in _while_loop_translation_rule
I0716 21:59:58.937474 139668278384448 run_docker.py:180] new_z = xla.jaxpr_subcomp(body_c, body_jaxpr.jaxpr, backend, axis_env,
I0716 21:59:58.937648 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/interpreters/xla.py", line 453, in jaxpr_subcomp
I0716 21:59:58.937821 139668278384448 run_docker.py:180] ans = rule(c, *in_nodes, **eqn.params)
I0716 21:59:58.937993 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jax/_src/lax/linalg.py", line 503, in _eigh_cpu_gpu_translation_rule
I0716 21:59:58.938167 139668278384448 run_docker.py:180] v, w, info = syevd_impl(c, operand, lower=lower)
I0716 21:59:58.938340 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jaxlib/cusolver.py", line 281, in syevd
I0716 21:59:58.938514 139668278384448 run_docker.py:180] lwork, opaque = cusolver_kernels.build_syevj_descriptor(
I0716 21:59:58.938688 139668278384448 run_docker.py:180] jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: cuSolver internal error
I0716 21:59:58.938864 139668278384448 run_docker.py:180]
I0716 21:59:58.939038 139668278384448 run_docker.py:180] The stack trace below excludes JAX-internal frames.
I0716 21:59:58.939213 139668278384448 run_docker.py:180] The preceding is the original exception that occurred, unmodified.
I0716 21:59:58.939386 139668278384448 run_docker.py:180] --------------------
I0716 21:59:58.939731 139668278384448 run_docker.py:180]
I0716 21:59:58.939903 139668278384448 run_docker.py:180] The above exception was the direct cause of the following exception:
I0716 21:59:58.940074 139668278384448 run_docker.py:180]
I0716 21:59:58.940248 139668278384448 run_docker.py:180] Traceback (most recent call last):
I0716 21:59:58.940423 139668278384448 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 283, in <module>
I0716 21:59:58.940596 139668278384448 run_docker.py:180] app.run(main)
I0716 21:59:58.940770 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
I0716 21:59:58.940943 139668278384448 run_docker.py:180] _run_main(main, args)
I0716 21:59:58.941116 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
I0716 21:59:58.941291 139668278384448 run_docker.py:180] sys.exit(main(argv))
I0716 21:59:58.941488 139668278384448 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 255, in main
I0716 21:59:58.941663 139668278384448 run_docker.py:180] predict_structure(
I0716 21:59:58.941836 139668278384448 run_docker.py:180] File "/app/alphafold/run_alphafold.py", line 137, in predict_structure
I0716 21:59:58.942028 139668278384448 run_docker.py:180] prediction_result = model_runner.predict(processed_feature_dict)
I0716 21:59:58.942201 139668278384448 run_docker.py:180] File "/app/alphafold/alphafold/model/model.py", line 134, in predict
I0716 21:59:58.942372 139668278384448 run_docker.py:180] result = self.apply(self.params, jax.random.PRNGKey(0), feat)
I0716 21:59:58.942544 139668278384448 run_docker.py:180] File "/opt/conda/lib/python3.8/site-packages/jaxlib/cusolver.py", line 281, in syevd
I0716 21:59:58.942715 139668278384448 run_docker.py:180] lwork, opaque = cusolver_kernels.build_syevj_descriptor(
I0716 21:59:58.942886 139668278384448 run_docker.py:180] RuntimeError: cuSolver internal error
Issue Analytics
- State:
- Created 2 years ago
- Comments:5
Top Results From Across the Web
cuSolver internal error. jax problems on gpu mode. everything ...
RuntimeError : jaxlib/cusolver_kernels.cc:44: operation cusolverDnCreate(&handle) failed: cuSolver internal error ...
Read more >Cuda Driver API and CUSolver internal error
In my project ( Rust lang) I use Cuda driver api through FFI and need to compute eigenvalues . I thought that CUSovler...
Read more >getrs function of cuSolver over pycuda doesn't work properly
CUSOLVER_STATUS_SUCCESS: raise RuntimeError('error!') return handle.value libcusolver.cusolverDnDestroy.restype = int libcusolver.
Read more >Torch.inverse cuda error - vision - PyTorch Forums
I used “a = torch.inverse(mat_inverse)” line in my code. It works fine on google colab but gives below error when run on GPU....
Read more >cuSOLVER Library - Rice University
CUSOLVER_STATUS_INTERNAL_ERROR. An internal cuSolver operation failed. This error is usually caused by a. cudaMemcpyAsync() failure.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I also solved it by modifying the Dockerfile:
And the structure prediction processes were completely performed. I obtained an amazingly accurate predicted structure!
The solution suggested by @kuixu is much simpler, so I will also try it!
I solved it by modified the code of installation of jax and jax-lib in Dockerfile.
pip3 install --upgrade “jax[cuda111]” -f https://storage.googleapis.com/jax-releases/jax_releases.html
Download jaxlib locally, and copy it into docker.
Steps:
from: https://github.com/deepmind/alphafold/blob/1109480e6f38d71b3b265a4a25039e51e2343368/docker/Dockerfile#L64-L67
to: