question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

jaxlib v0.1.68 breaks CI with segfault on macOS

See original GitHub issue

Description

On 2021-06-22 the scheduled nightly CI for v0.6.2 was passing and had installed libraries pass-pip-list.txt. Then on 2021-06-23 the CI fails with a segfault and had and had installed libraries fail-pip-list.txt, where the difference between them is the versions of jax and jaxlib.

$ diff pass-pip-list.txt fail-pip-list.txt 
5a6
> appnope                0.1.2
41,42c42,43
< jax                    0.2.14
< jaxlib                 0.1.67
---
> jax                    0.2.16
> jaxlib                 0.1.68
97c98
< pyhf                   0.6.2     /home/runner/work/pyhf/pyhf/src
---
> pyhf                   0.6.2     /Users/runner/work/pyhf/pyhf/src

The relevant section of the logs for the failure is the following:

src/pyhf/infer/utils.py ..                                               [  3%]
Fatal Python error: Segmentation fault

Thread 0x000070000dda9000 (most recent call first):
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py", line 306 in wait
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py", line 558 in wait
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00000001050cfdc0 (most recent call first):
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jaxlib/xla_client.py", line 67 in make_cpu_client
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/lib/xla_bridge.py", line 206 in backends
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/lib/xla_bridge.py", line 242 in get_backend
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/lib/xla_bridge.py", line 263 in get_device_backend
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/interpreters/xla.py", line 138 in _device_put_array
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/interpreters/xla.py", line 133 in device_put
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/_src/lax/lax.py", line 1596 in _device_put_raw
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/_src/numpy/lax_numpy.py", line 3025 in array
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/jax/_src/numpy/lax_numpy.py", line 3064 in asarray
  File "/Users/runner/work/pyhf/pyhf/src/pyhf/tensor/jax_backend.py", line 230 in astensor
  File "/Users/runner/work/pyhf/pyhf/src/pyhf/tensor/common.py", line 30 in _precompute
  File "/Users/runner/work/pyhf/pyhf/src/pyhf/events.py", line 36 in __call__
  File "/Users/runner/work/pyhf/pyhf/src/pyhf/__init__.py", line 147 in set_backend
  File "/Users/runner/work/pyhf/pyhf/src/pyhf/events.py", line 93 in register_wrapper
  File "<doctest pyhf.tensor.jax_backend.jax_backend.astensor[1]>", line 1 in <module>
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/doctest.py", line 1336 in __run
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/doctest.py", line 1483 in run
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/doctest.py", line 1844 in run
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/doctest.py", line 287 in runtest
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/site-packages/pytest/__main__.py", line 5 in <module>
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/runpy.py", line 87 in _run_code
  File "/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/runpy.py", line 194 in _run_module_as_main
/Users/runner/work/_temp/b65896af-bc5b-4842-94da-e0fd5882e8d5.sh: line 1:  1785 Segmentation fault: 11  python -m pytest -r sx --ignore tests/benchmarks/ --ignore tests/contrib --ignore tests/test_notebooks.py
/Users/runner/hostedtoolcache/Python/3.8.10/x64/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
src/pyhf/tensor/jax_backend.py 
Error: Process completed with exit code 139.

Both jax and jaxlib had releases on 2021-06-23:

@lukasheinrich @kratsg we’ll need to follow up with the JAX team.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
jpivarskicommented, Jun 25, 2021

A lot of tests are still running (Windows takes a long time to compile), but all of the MacOS ones passed on the first try, which by Poisson statistics or whatever means jaxlib is almost certainly to blame:

image

Okay, I went and tried computing it; I think the average number of MacOS failures over the last few days has been 2.5 per run of 6, which has 92% of the distribution above 0 in a run of 6. Like, two sigma.

>>> scipy.stats.poisson.pmf(0, 2.5)
0.0820849986238988

Has anything about this been reported in the JAX project?

1reaction
jpivarskicommented, Jun 25, 2021

We’ve only been seeing it in MacOS (not Linux/Ubuntu and not Windows), but it’s not too surprising that a segfault is limited to only one platform. It’s intermittent, too, so it probably has something to do with how uninitialized data happens to be filled from the previous step, which is very unpredictable but can be strongly correlated with platform. (I.e. the Ubuntu builds could be failing with a much smaller probability, or maybe some totally unrelated thing in the OS prevents it with certainty.)

I’ll try pinning jaxlib<0.1.68 to see what happens in Awkward Array’s tests. The probability of segfaulting is such that 2 or 3 of the 6 MacOS builds typically fail, so if it makes it through a round with 0 segfaults, that’s good evidence that it’s totally related to the new jaxlib. I’ll post results in about 20 minutes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

jaxlib v0.1.68 causing nondeterministic segfault for ... - GitHub
Both pyhf and awkward have been seeing segfaults on GitHub Actions ... jaxlib v0.1.68 breaks CI with segfault on macOS scikit-hep/pyhf#1501.
Read more >
Segmentation fault 11 and other errors when trying to use JAX ...
Given the _SecKeyCopyExternalRepresentation error, I suspect the issue is your OSX version is too old for the pre-built binaries available ...
Read more >
jax Changelog - pyup.io
[GitHub commits](https://github.com/google/jax/compare/jaxlib-v0.3.15...jaxlib-v0. ... This may break users that were using JAX internals. ... Mac OS 10.14
Read more >
Release history — Awkward Array documentation
@ianna fixed the intermittent MacOS segfault, which was a bug on all ... PR #988: fix: Unrestrict jaxlib upper bound and exclude jaxlib...
Read more >
EasyBuild Documentation - Read the Docs
EasyBuild is a software build and installation framework that allows you to manage (scientific) software on High.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found