jaxlib v0.1.68 breaks CI with segfault on macOS
See original GitHub issueDescription
On 2021-06-22 the scheduled nightly CI for v0.6.2 was passing and had installed libraries pass-pip-list.txt. Then on 2021-06-23 the CI fails with a segfault and had and had installed libraries fail-pip-list.txt, where the difference between them is the versions of jax and jaxlib.
$ diff pass-pip-list.txt fail-pip-list.txt
5a6
> appnope 0.1.2
41,42c42,43
< jax 0.2.14
< jaxlib 0.1.67
---
> jax 0.2.16
> jaxlib 0.1.68
97c98
< pyhf 0.6.2 /home/runner/work/pyhf/pyhf/src
---
> pyhf 0.6.2 /Users/runner/work/pyhf/pyhf/src
The relevant section of the logs for the failure is the following:
src/pyhf/infer/utils.py .. [ 3%]
Fatal Python error: Segmentation fault
Thread 0x000070000dda9000 (most recent call first):
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py", line 306 in wait
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py", line 558 in wait
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py", line 890 in _bootstrap
Current thread 0x00000001050cfdc0 (most recent call first):
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jaxlib/xla_client.py", line 67 in make_cpu_client
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/lib/xla_bridge.py", line 206 in backends
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/lib/xla_bridge.py", line 242 in get_backend
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/lib/xla_bridge.py", line 263 in get_device_backend
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/interpreters/xla.py", line 138 in _device_put_array
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/interpreters/xla.py", line 133 in device_put
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/_src/lax/lax.py", line 1596 in _device_put_raw
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/_src/numpy/lax_numpy.py", line 3025 in array
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/_src/numpy/lax_numpy.py", line 3064 in asarray
File "/Users/runner/work/pyhf/pyhf/src/pyhf/tensor/jax_backend.py", line 230 in astensor
File "/Users/runner/work/pyhf/pyhf/src/pyhf/tensor/common.py", line 30 in _precompute
File "/Users/runner/work/pyhf/pyhf/src/pyhf/events.py", line 36 in __call__
File "/Users/runner/work/pyhf/pyhf/src/pyhf/__init__.py", line 147 in set_backend
File "/Users/runner/work/pyhf/pyhf/src/pyhf/events.py", line 93 in register_wrapper
File "<doctest pyhf.tensor.jax_backend.jax_backend.astensor[1]>", line 1 in <module>
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/doctest.py", line 1336 in __run
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/doctest.py", line 1483 in run
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/doctest.py", line 1844 in run
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/doctest.py", line 287 in runtest
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pytest/__main__.py", line 5 in <module>
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/runpy.py", line 87 in _run_code
File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/runpy.py", line 194 in _run_module_as_main
/Users/runner/work/_temp/b65896af-bc5b-4842-94da-e0fd5882e8d5.sh: line 1: 1785 Segmentation fault: 11 python -m pytest -r sx --ignore tests/benchmarks/ --ignore tests/contrib --ignore tests/test_notebooks.py
/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
src/pyhf/tensor/jax_backend.py
Error: Process completed with exit code 139.
Both jax and jaxlib had releases on 2021-06-23:
@lukasheinrich @kratsg we’ll need to follow up with the JAX team.
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (14 by maintainers)
Top Results From Across the Web
jaxlib v0.1.68 causing nondeterministic segfault for ... - GitHub
Both pyhf and awkward have been seeing segfaults on GitHub Actions ... jaxlib v0.1.68 breaks CI with segfault on macOS scikit-hep/pyhf#1501.
Read more >Segmentation fault 11 and other errors when trying to use JAX ...
Given the _SecKeyCopyExternalRepresentation error, I suspect the issue is your OSX version is too old for the pre-built binaries available ...
Read more >jax Changelog - pyup.io
[GitHub commits](https://github.com/google/jax/compare/jaxlib-v0.3.15...jaxlib-v0. ... This may break users that were using JAX internals. ... Mac OS 10.14
Read more >Release history — Awkward Array documentation
@ianna fixed the intermittent MacOS segfault, which was a bug on all ... PR #988: fix: Unrestrict jaxlib upper bound and exclude jaxlib...
Read more >EasyBuild Documentation - Read the Docs
EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

A lot of tests are still running (Windows takes a long time to compile), but all of the MacOS ones passed on the first try, which by Poisson statistics or whatever means jaxlib is almost certainly to blame:
Okay, I went and tried computing it; I think the average number of MacOS failures over the last few days has been 2.5 per run of 6, which has 92% of the distribution above 0 in a run of 6. Like, two sigma.
Has anything about this been reported in the JAX project?
We’ve only been seeing it in MacOS (not Linux/Ubuntu and not Windows), but it’s not too surprising that a segfault is limited to only one platform. It’s intermittent, too, so it probably has something to do with how uninitialized data happens to be filled from the previous step, which is very unpredictable but can be strongly correlated with platform. (I.e. the Ubuntu builds could be failing with a much smaller probability, or maybe some totally unrelated thing in the OS prevents it with certainty.)
I’ll try pinning
jaxlib<0.1.68to see what happens in Awkward Array’s tests. The probability of segfaulting is such that 2 or 3 of the 6 MacOS builds typically fail, so if it makes it through a round with 0 segfaults, that’s good evidence that it’s totally related to the new jaxlib. I’ll post results in about 20 minutes.