Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`jaxlib==0.1.44` segfaults when trying to run XLA on GPU

See original GitHub issue

When trying to run JAX with jaxlib==0.1.44 I run in to a segmentation fault on my machine with Python 3.8 and CUDA 10.2 if I run on GPU. This issue no longer occurs if I downgrade jaxlib to 0.1.43.

I installed jaxlib using the installation instructions in the README for both versions, and I properly set the XLA CUDA directory in both cases to the same location. From what I gather, only jaxlib is changing to generate the segfault.

I tried to do some digging and it seems like the segfault is coming from jaxlib/xla_extension.so, particularly here is what gdb produces:

0x00007fffd6f991e8 in absl::lts_2020_02_25::Mutex::ReaderLock() () from /home/ziyadedher/research/.venv/lib/python3.8/site-packages/jaxlib/xla_extension.so

Reverting to jaxlib==0.1.43 fixes the issue.

>>> jax.__version__
'0.1.63'
>>> jaxlib.__version__
'0.1.44'
>>>  tensorflow.__version__
'2.2.0-rc3'

Some system information truncated to show the important bits:

$ nvcc --version
Cuda compilation tools, release 10.2, V10.2.89
$ python --version
Python 3.8.2
$ modinfo nvidia
filename:       /lib/modules/5.6.4-arch1-1/extramodules/nvidia.ko.xz
version:        440.82

Issue Analytics

State:
Created 3 years ago
Comments:15 (6 by maintainers)

Top GitHub Comments

6reactions

hawkinspcommented, Apr 21, 2020

We have a strong suspicion that the bug is here: https://github.com/tensorflow/tensorflow/blob/05991352f7fdb12ed774561269609fd908e7f95e/tensorflow/compiler/xla/python/local_client.cc#L778

.release() and .get() are called on a std::unique_ptr in different arguments to the same function. Argument order of evaluation differs between compilers (e.g., clang vs gcc). We tend to test clang internally (and have never seen this bug) but our external builds are built with gcc which has the opposite order of evaluation. @skye is preparing a fix.

5reactions

skyecommented, Apr 21, 2020

This should be fixed in jaxlib 0.1.45, hot off the press! I’m gonna close this, but please let us know if you’re still experiencing segfaults. (Here’s the fix for anyone interested: https://github.com/tensorflow/tensorflow/commit/78edbb6403b73d6c79bd58e23e08dc21b5c33847)