jaxlib v0.1.68 causing nondeterministic segfault for only macOS on GitHub Actions and Azure Pipelines servers
See original GitHub issueHi. This is a bit of a strange (possible) bug report that we’ve held off on a for a few days until we could try to get better reporting information. Both pyhf
and awkward
have been seeing segfaults on GitHub Actions jobs (for pyhf
) and Azure Pipelines jobs (for awkward
) since the release of jaxlib
v0.1.68
that happen only for v0.1.68
(c.f. https://github.com/scikit-hep/pyhf/issues/1501)
$ pip list | grep jax
jax 0.2.16
jaxlib 0.1.68
and go away if we downgrade to jaxlib<0.1.68
(c.f. https://github.com/scikit-hep/awkward-1.0/pull/963 and https://github.com/scikit-hep/pyhf/pull/1502).
$ pip list | grep jax
jax 0.2.16
jaxlib 0.1.67
The bizarre part is that I am unable to replicate these segfaults on a MacBook Air that I’ve borrowed to debug this.
Minimal Failing Examples on GitHub Actions
The pyhf
test suite has been segfaulting during runs as documented in https://github.com/scikit-hep/pyhf/issues/1501. To look at the environment in which this was happening I connected to a tmate
session on the GHA servers using the mxschmitt/action-tmate@v3
GHA and I was able to replicate the segfault behavior on GHA with the following examples using just pure JAX
# debug_32b.py
import jax # noqa: F401
import jax.numpy as jnp
print(jnp.asarray([-2, -1], dtype=jnp.float32))
print(jnp.asarray([-2, -1], dtype=jnp.float64))
# debug_64.py
import jax # noqa: F401
from jax.config import config
config.update('jax_enable_x64', True)
import jax.numpy as jnp
# 32b first
jnp.asarray([-2, -1])
# then switch to 64b
jnp.asarray([-2, -1], dtype=jnp.float64)
and the following commands (with the bash-3.2
removed from before the $
for formatting) using both the deubg_32b.py
$ python debug_32b.py
Segmentation fault: 11
$ python debug_32b.py
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[-2. -1.]
/Users/runner/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py:3062: UserWarning: Explicitly requested dtype <class 'jax._src.numpy.lax_numpy.float64'> requested in asarray is not available, and will be truncated to dtype float32. To enable more dtypes, set the jax_enable_x64 configuration option or the JAX_ENABLE_X64 shell environment variable. See https://github.com/google/jax#current-gotchas for more.
lax._check_user_dtype_supported(dtype, "asarray")
[-2. -1.]
$ python debug_32b.py
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[-2. -1.]
/Users/runner/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py:3062: UserWarning: Explicitly requested dtype <class 'jax._src.numpy.lax_numpy.float64'> requested in asarray is not available, and will be truncated to dtype float32. To enable more dtypes, set the jax_enable_x64 configuration option or the JAX_ENABLE_X64 shell environment variable. See https://github.com/google/jax#current-gotchas for more.
lax._check_user_dtype_supported(dtype, "asarray")
[-2. -1.]
$ python debug_32b.py
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[-2. -1.]
/Users/runner/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py:3062: UserWarning: Explicitly requested dtype <class 'jax._src.numpy.lax_numpy.float64'> requested in asarray is not available, and will be truncated to dtype float32. To enable more dtypes, set the jax_enable_x64 configuration option or the JAX_ENABLE_X64 shell environment variable. See https://github.com/google/jax#current-gotchas for more.
lax._check_user_dtype_supported(dtype, "asarray")
[-2. -1.]
$ python debug_32b.py
Segmentation fault: 11
and the debug_64b.py
$ python debug_64b.py
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
$ python debug_64b.py
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
$ python debug_64b.py
Segmentation fault: 11
For the GHA sever the env is (the same happens on the Python 3.8 jobs)
$ python --version --version
Python 3.7.10 (default, Feb 16 2021, 11:44:40)
[Clang 11.0.0 (clang-1100.0.33.17)]
$ printenv
GITHUB_JOB=test
GITHUB_EVENT_PATH=/Users/runner/work/_temp/_github_workflow/event.json
RUNNER_OS=macOS
XCODE_12_DEVELOPER_DIR=/Applications/Xcode_12.4.app/Contents/Developer
ANDROID_HOME=/Users/runner/Library/Android/sdk
GITHUB_BASE_REF=
NVM_CD_FLAGS=
CHROMEWEBDRIVER=/usr/local/Caskroom/chromedriver/91.0.4472.101
SHELL=/bin/bash
TERM=screen-256color
PIPX_BIN_DIR=/usr/local/opt/pipx_bin
GITHUB_REPOSITORY_OWNER=scikit-hep
INPUT_SUDO=true
TMPDIR=/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/
GITHUB_ACTIONS=true
GITHUB_RUN_NUMBER=7368
ANDROID_SDK_ROOT=/Users/runner/Library/Android/sdk
JAVA_HOME_8_X64=/Users/runner/hostedtoolcache/Java_Adopt_jdk/8.0.292-10/x64/Contents/Home
RCT_NO_LAUNCH_PACKAGER=1
RUNNER_WORKSPACE=/Users/runner/work/pyhf
NUNIT_BASE_PATH=/Library/Developer/nunit
RUNNER_PERFLOG=/usr/local/opt/runner/perflog
GITHUB_REF=refs/heads/fix/test-jax-version-that-breaks-ci
GITHUB_WORKFLOW=CI/CD
LC_ALL=en_US.UTF-8
NUNIT3_PATH=/Library/Developer/nunit/3.6.0
JAVA_HOME_11_X64=/Users/runner/hostedtoolcache/Java_Adopt_jdk/11.0.11-9/x64/Contents/Home
RUNNER_TOOL_CACHE=/Users/runner/hostedtoolcache
GITHUB_ACTION_REPOSITORY=mxschmitt/action-tmate
JAVA_HOME_14_X64=/Users/runner/hostedtoolcache/Java_Adopt_jdk/14.0.2-12/x64/Contents/Home
NVM_DIR=/Users/runner/.nvm
USER=runner
GITHUB_API_URL=https://api.github.com
GITHUB_EVENT_NAME=push
GITHUB_SHA=2e371805064fc961c95106c4098702b3696827c3
XCODE_10_DEVELOPER_DIR=/Applications/Xcode_10.3.app/Contents/Developer
RUNNER_TEMP=/Users/runner/work/_temp
pythonLocation=/Users/runner/hostedtoolcache/Python/3.7.10/x64
ANDROID_NDK_ROOT=/Users/runner/Library/Android/sdk/ndk-bundle
ANDROID_NDK_LATEST_HOME=/Users/runner/Library/Android/sdk/ndk/22.1.7171670
ImageVersion=20210620.1
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.I24a5PqIL3/Listeners
GITHUB_SERVER_URL=https://github.com
HOMEBREW_NO_AUTO_UPDATE=1
__CF_USER_TEXT_ENCODING=0x1F5:0:0
AGENT_TOOLSDIRECTORY=/Users/runner/hostedtoolcache
GITHUB_HEAD_REF=
GITHUB_GRAPHQL_URL=https://api.github.com/graphql
TMUX=/tmp/tmate.sock,1968,0
PATH=/Users/runner/hostedtoolcache/Python/3.7.10/x64/bin:/Users/runner/hostedtoolcache/Python/3.7.10/x64:/usr/local/opt/pipx_bin:/Users/runner/.cargo/bin:/usr/local/lib/ruby/gems/2.7.0/bin:/usr/local/opt/ruby@2.7/bin:/usr/local/opt/curl/bin:/usr/local/bin:/usr/local/sbin:/Users/runner/bin:/Users/runner/.yarn/bin:/Users/runner/Library/Android/sdk/tools:/Users/runner/Library/Android/sdk/platform-tools:/Users/runner/Library/Android/sdk/ndk-bundle:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/usr/bin:/bin:/usr/sbin:/sbin:/Users/runner/.dotnet/tools:/Users/runner/.ghcup/bin:/Users/runner/hostedtoolcache/stack/2.7.1/x64
INPUT_LIMIT-ACCESS-TO-ACTOR=false
GITHUB_RETENTION_DAYS=90
PERFLOG_LOCATION_SETTING=RUNNER_PERFLOG
CONDA=/usr/local/miniconda
DOTNET_ROOT=/Users/runner/.dotnet
EDGEWEBDRIVER=/usr/local/share/edge_driver
PWD=/Users/runner/work/pyhf/pyhf
VM_ASSETS=/usr/local/opt/runner/scripts
JAVA_HOME=/Users/runner/hostedtoolcache/Java_Adopt_jdk/8.0.292-10/x64/Contents/Home
JAVA_HOME_12_X64=/Users/runner/hostedtoolcache/Java_Adopt_jdk/12.0.2-10.3/x64/Contents/Home
VCPKG_INSTALLATION_ROOT=/usr/local/share/vcpkg
LANG=en_US.UTF-8
ImageOS=macos1015
TMUX_PANE=%0
XPC_FLAGS=0x0
PIPX_HOME=/usr/local/opt/pipx
GECKOWEBDRIVER=/usr/local/opt/geckodriver/bin
GITHUB_ACTOR=matthewfeickert
XPC_SERVICE_NAME=0
HOME=/Users/runner
SHLVL=4
ACTIONS_RUNTIME_TOKEN=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Ik9ta3lYbmJnM05RTE1nMGZMaTBSNnJxdzlxdyJ9.eyJuYW1laWQiOiJkZGRkZGRkZC1kZGRkLWRkZGQtZGRkZC1kZGRkZGRkZGRkZGQiLCJzY3AiOiJBY3Rpb25zLkdlbmVyaWNS
ZWFkOjAwMDAwMDAwLTAwMDAtMDAwMC0wMDAwLTAwMDAwMDAwMDAwMCBBY3Rpb25zLlVwbG9hZEFydGlmYWN0czowMDAwMDAwMC0wMDAwLTAwMDAtMDAwMC0wMDAwMDAwMDAwMDAvMTpCdWlsZC9CdWlsZC8yMzU5MSBMb2NhdGlvblNlcnZpY2UuQ29ubmVjdCBSZWFkQW5
kVXBkYXRlQnVpbGRCeVVyaTowMDAwMDAwMC0wMDAwLTAwMDAtMDAwMC0wMDAwMDAwMDAwMDAvMTpCdWlsZC9CdWlsZC8yMzU5MSIsIklkZW50aXR5VHlwZUNsYWltIjoiU3lzdGVtOlNlcnZpY2VJZGVudGl0eSIsImh0dHA6Ly9zY2hlbWFzLnhtbHNvYXAub3JnL3dzLz
IwMDUvMDUvaWRlbnRpdHkvY2xhaW1zL3NpZCI6IkRERERERERELUREREQtRERERC1ERERELURERERERERERERERCIsImh0dHA6Ly9zY2hlbWFzLm1pY3Jvc29mdC5jb20vd3MvMjAwOC8wNi9pZGVudGl0eS9jbGFpbXMvcHJpbWFyeXNpZCI6ImRkZGRkZGRkLWRkZGQtZ
GRkZC1kZGRkLWRkZGRkZGRkZGRkZCIsImF1aSI6ImE1YTFiNzdhLWFhYTktNDFiNi05ZTRjLWQ2OWI4NzRiMmRkNCIsInNpZCI6IjU0OGJjZjNlLWU3MWYtNDI2YS1iYTY0LTNkNmZjNjVhM2JkYSIsImFjIjoiW3tcIlNjb3BlXCI6XCJyZWZzL2hlYWRzL2ZpeC90ZXN0
LWpheC12ZXJzaW9uLXRoYXQtYnJlYWtzLWNpXCIsXCJQZXJtaXNzaW9uXCI6M30se1wiU2NvcGVcIjpcInJlZnMvaGVhZHMvbWFzdGVyXCIsXCJQZXJtaXNzaW9uXCI6MX1dIiwib3JjaGlkIjoiMDhhMmQ1NDEtNTU5NC00NjBlLWFlOGQtMjM3YjUyZTYyYjY1LnRlc3Q
ubWFjb3MtbGF0ZXN0XzNfNyIsImlzcyI6InZzdG9rZW4uYWN0aW9ucy5naXRodWJ1c2VyY29udGVudC5jb20iLCJhdWQiOiJ2c3Rva2VuLmFjdGlvbnMuZ2l0aHVidXNlcmNvbnRlbnQuY29tfHZzbzo1YmY5NjQ5Zi01MjlhLTRhYmMtODAwYS1iNThhMDNjZDNlM2IiLC
JuYmYiOjE2MjQ5MTIzOTEsImV4cCI6MTYyNDkzNTE5MX0.o0oeOn2M2Dbx8-K3yhe4JGA7k9KmR9KBoVAujCk29uptx7HOPfB1kba1l4Ofylm1DeKuB0xfMF5Y8ttibvDTgH2HitCC3BMdL64LZ99IUNnjngkuUsGuQsFI3E3uwT3SF6OpQcaeLjtCV3Qx2iUGkPsWM8Tpt
XD0TH4IXw5NJsbx3rKHHC2aSM6384Im-Nu965w_7539XkaIyLkg8MFK9MTIBr0O0HfRxJqvvareP7ufdqDnvY9EVupoVCdSEs3Xe5fuYW_GJvsKHImbsGoRTOgTFgiwOFxYIiMvcjyU1PDjg3ttjBF0JiMmReypLgSsQqUD-BrPIvjKuHYuzQplTg
RUNNER_TRACKING_ID=github_f9075cf8-002b-4c98-a7e2-61bcc0d94891
ANDROID_NDK_18R_PATH=/Users/runner/Library/Android/sdk/ndk/18.1.5063045
GITHUB_WORKSPACE=/Users/runner/work/pyhf/pyhf
CI=true
GITHUB_ACTION_REF=v3
GITHUB_RUN_ID=980389940
ACTIONS_RUNTIME_URL=https://pipelines.actions.githubusercontent.com/7egiF0eguRHanWqGVl5G5J1mX1k4YmsTgFLGKvP1guMOJIVNqS/
LOGNAME=runner
ACTIONS_CACHE_URL=https://artifactcache.actions.githubusercontent.com/7egiF0eguRHanWqGVl5G5J1mX1k4YmsTgFLGKvP1guMOJIVNqS/
GITHUB_ENV=/Users/runner/work/_temp/_runner_file_commands/set_env_af7b2b08-a369-43cf-87cc-23e9c7f65cbc
LC_CTYPE=en_US.UTF-8
HOMEBREW_CLEANUP_PERIODIC_FULL_DAYS=3650
JAVA_HOME_13_X64=/Users/runner/hostedtoolcache/Java_Adopt_jdk/13.0.2-8.1/x64/Contents/Home
HOMEBREW_CASK_OPTS=--no-quarantine
POWERSHELL_DISTRIBUTION_CHANNEL=GitHub-Actions-macos1015
ANDROID_NDK_HOME=/Users/runner/Library/Android/sdk/ndk-bundle
BOOTSTRAP_HASKELL_NONINTERACTIVE=1
XCODE_11_DEVELOPER_DIR=/Applications/Xcode_11.7.app/Contents/Developer
GITHUB_REPOSITORY=scikit-hep/pyhf
GITHUB_PATH=/Users/runner/work/_temp/_runner_file_commands/add_path_af7b2b08-a369-43cf-87cc-23e9c7f65cbc
GITHUB_ACTION=mxschmittaction-tmate
DOTNET_MULTILEVEL_LOOKUP=0
_=/usr/bin/printenv
However, I am unable to replicate this at all on the Macbook Air
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G14042
$ python --version --version
Python 3.8.10 (default, Jun 27 2021, 18:38:01)
[Clang 10.0.0 (clang-1000.10.44.4)]
$ printenv
SSH_AGENT_PID=533
TERM_PROGRAM=iTerm.app
PYENV_ROOT=/Users/cerylinae/.pyenv
TERM=xterm-256color
SHELL=/bin/bash
TMPDIR=/var/folders/rx/t5jm47z56bxfxmbp2qs6fsj80000gn/T/
Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.dWj7SOkSaA/Render
TERM_PROGRAM_VERSION=3.3.12
OLDPWD=/Users/cerylinae/Code
TERM_SESSION_ID=w0t0p0:FEA1A898-9304-451B-9F5E-765940B67423
PYENV_VERSION=pyhf-debug
USER=cerylinae
SSH_AUTH_SOCK=/var/folders/rx/t5jm47z56bxfxmbp2qs6fsj80000gn/T//ssh-g3V3yN8vZC0o/agent.532
__CF_USER_TEXT_ENCODING=0x0:0:0
PYENV_VIRTUALENV_INIT=1
VIRTUAL_ENV=/Users/cerylinae/.pyenv/versions/3.8.10/envs/pyhf-debug
PYENV_VIRTUAL_ENV=/Users/cerylinae/.pyenv/versions/3.8.10/envs/pyhf-debug
PATH=/Users/cerylinae/.pyenv/plugins/pyenv-virtualenv/shims:/Users/cerylinae/.pyenv/shims:/Users/cerylinae/.pyenv/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin
PWD=/Users/cerylinae/Code/pyhf
LANG=en_US.UTF-8
ITERM_PROFILE=Default
_OLD_VIRTUAL_PS1=\h:\W \u\$
XPC_FLAGS=0x0
PS1=(pyhf-debug) \h:\W \u\$
XPC_SERVICE_NAME=0
PYENV_SHELL=bash
SHLVL=1
HOME=/Users/cerylinae
COLORFGBG=7;0
LC_TERMINAL_VERSION=3.3.12
ITERM_SESSION_ID=w0t0p0:FEA1A898-9304-451B-9F5E-765940B67423
LOGNAME=cerylinae
LC_TERMINAL=iTerm2
DISPLAY=/private/tmp/com.apple.launchd.39ujEqef0g/org.macosforge.xquartz:0
PYENV_ACTIVATE_SHELL=1
COLORTERM=truecolor
_=/usr/bin/printenv
We thought that we would report this to the JAX team as we are unable to replicate this behavior with older versions of jaxlib
, but as we’re unable to replicate this locally for jaxlib
v0.1.68
if you’d like us to open complimentary issues with the GitHub Actions virtual environments team we’re happy to do so as well.
We’re happy to do whatever we can to try to help debug this, and if it is of any help there is a branch of the pyhf
repo (fix/test-jax-version-that-breaks-ci
) that has the examples shown here on it.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (5 by maintainers)
Top GitHub Comments
It looks like this is related to the alignment of an AVX instruction:
I think
vmovaps
requires 32-byte alignment, but0x...b0
is only 16-byte aligned.This problem only appears to reproduce under the newer TFRT runtime. If you use
jax
from head, you can setJAX_CPU_BACKEND_VARIANT=stream_executor
which works around the problem.I’ll keep debugging, this is most odd.
Now that the fix is in. I’m building a new jaxlib release.