Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"DNN library is not found." error when tensorflow is loaded before JAX

See original GitHub issue

Please:

Check for duplicate issues.
Provide a complete example of how to reproduce the bug, wrapped in triple backticks like this:

import jax.numpy as jnp
import tensorflow_datasets as tfds
from flax import linen as nn
from jax import random

# See https://github.com/tensorflow/tensorflow/issues/53831.
train_ds = tfds.load("cifar10", split="train", as_supervised=True)

model = nn.Conv(features=1, kernel_size=(3, 3), strides=(1, 1))
params = model.init(random.PRNGKey(123), jnp.zeros((1, 32, 32, 3)))

gives me an error:

RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv = (f32[1,32,32,1]{2,1,3,0}, u8[0]{0}) custom-call(f32[1,32,32,3]{2,1,3,0} %copy.3, f32[3,3,3,1]{1,0,2,3} %copy.4), window={size=3x3 pad=1_1x1_1}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", metadata={op_type="conv_general_dilated" op_name="jit(conv_general_dilated)/conv_general_dilated[\n  batch_group_count=1\n  dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 3, 1, 2), rhs_spec=(3, 2, 0, 1), out_spec=(0, 3, 1, 2))\n  feature_group_count=1\n  lhs_dilation=(1, 1)\n  lhs_shape=(1, 32, 32, 3)\n  padding=((1, 1), (1, 1))\n  precision=None\n  preferred_element_type=None\n  rhs_dilation=(1, 1)\n  rhs_shape=(3, 3, 3, 1)\n  window_strides=(1, 1)\n]" source_file="/nix/store/ys9bmmwpdqf3vlgxjvfy770qdk4dcf1n-python3.9-flax-0.3.6/lib/python3.9/site-packages/flax/linen/linear.py" source_line=282}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}"

Original error: UNIMPLEMENTED: DNN library is not found.

But if I force TF to run on CPU with

import tensorflow as tf

tf.config.set_visible_devices([], 'GPU')

import jax.numpy as jnp
import tensorflow_datasets as tfds
from flax import linen as nn
from jax import random

# See https://github.com/tensorflow/tensorflow/issues/53831.
train_ds = tfds.load("cifar10", split="train", as_supervised=True)

model = nn.Conv(features=1, kernel_size=(3, 3), strides=(1, 1))
params = model.init(random.PRNGKey(123), jnp.zeros((1, 32, 32, 3)))

Then it works!

Why does TF having access to the GPU affect JAX’s ability to locate cuDNN?

Here’s my shell.nix for complete reproducibility: https://gist.github.com/samuela/319059b88a46a994b4c10dfa718f379e And here’s a relevant comment on another issue: https://github.com/NixOS/nixpkgs/pull/158186#issuecomment-1030486912

If applicable, include full error messages/tracebacks.

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mattjjcommented, Mar 16, 2022

I’m going to close this issue because there are already a few open that are about making this error message better.

0reactions

samuelacommented, Feb 5, 2022

Ah, I see. I still find the error message confusing since cuDNN is found, just does not succeed in initializing. But I think I can get things working from here.