Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas)

See original GitHub issue

This is a follow up to gh-14688. That issue was originally about a kernel panic (fixed in macOS 12.0.1), and after that the same reproducer showed severe performance issues. This issue is about those performance issues. Note that while the reproducer is the same, it’s not clear whether or not the kernel panic and the performance issues share a root cause or not.

Issue reproducer

A reproducer (warning: do NOT run on macOS 11.x, it will crash the OS):

from time import perf_counter
import numpy as np
from scipy.sparse.linalg import eigsh


n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T

for i in range(10):
    print("running eigsh...")
    tic = perf_counter()
    s, _ = eigsh(K, 3, which="LA", tol=0)
    toc = perf_counter()
    print(f"computed {s} in {toc - tic:.3f} s")

Running scipy.test() or scipy.linalg.test() will also show a significant performance impact.

Performance impact

In situations where we hit the performance problem, the above code will show:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.062 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.888 s
...

And if we don’t hit that problem:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
...

So a ~50x slowdown for this particular example.

There is in general an impact on functions that use BLAS/LAPACK. The impact on the total time taken by scipy.test() was about 30% (311 sec. with default settings, 234 sec. when using OPENBLAS_NUM_THREADS=1) - note that this was just a single test on one build config, results may vary: https://github.com/scipy/scipy/issues/14688#issuecomment-968748706. The single-threaded case has similar timings as when running the test suite on a scipy install that doesn’t show the problem at all (~240 sec. seems expected on arm64 macOS, and it doesn’t depend on the threading setting (because test arrays are always small)). Important: ensure pytest-xdist is not installed when looking at time taken by the test suite (see gh-14425 for why).

When the problem occurs

The discussion in gh-14688 showed that this problem gets hit when two copies of libopenblas get loaded. The following configurations showed a problem so far:

Installing both numpy and scipy from a wheel (e.g., numpy 1.21.4 from PyPI and the latest 1.8.0.dev0 wheel from https://anaconda.org/scipy-wheels-nightly/scipy/)
Installing numpy 1.21.4 from PyPI and installing scipy locally when built against conda-forge’s openblas.

These configurations did not show a problem:

Installing numpy 1.21.4 from PyPI and installing scipy locally when built against Homebrew’s openblas.
Any situation where only a single libopenblas is loaded.

It is unclear right now what the exact root cause is. The situation when using conda-forge’s openblas is very similar to that using Homebrew’s openblas, but only one of those triggers the issue. The most important situation is installing both NumPy and SciPy from wheels though, that’s what the vast majority of pip/PyPI users will get.

A difference between conda-forge and Homebrew that may be relevant is that the former uses @rpath and the latter a hardcoded path to load libopenblas:

% # conda-forge
% otool -L _fblas.cpython-39-darwin.so
_fblas.cpython-39-darwin.so:
	@rpath/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

% # Homebrew
% otool -L  /opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-*-darwin.so
/opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-39-darwin.so:
	/opt/homebrew/opt/openblas/lib/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/opt/homebrew/opt/gcc/lib/gcc/11/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	/opt/homebrew/opt/gcc/lib/gcc/11/libgcc_s.1.1.dylib (compatibility version 1.0.0, current version 1.1.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

That may not be the only difference, e.g. compilers used to build libopenblas and scipy were not the same. Also libopenblas can be built with either pthreads or openmp - numpy and scipy wheels use pthreads, while conda-forge and Homebrew both use openmp.

To check if two libopenblas libraries get loaded, use:

❯ python -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib",
    "version": "0.3.18",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  }
]

Context: why do 2 libopenblas copies get loaded

The reason is that the NumPy and SciPy wheels both vendor a copy of libopenblas within them, and extension modules that need libopenblas are depending directly on that vendored copy:

% cd /path/to/site-packages/scipy/linalg
% otool -L _fblas.cpython-39-darwin.so 
_fblas.cpython-39-darwin.so:
	@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	@loader_path/../.dylibs/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

% cd ../../numpy/linalg
% otool -L _umath_linalg.cpython-39-darwin.so 
_umath_linalg.cpython-39-darwin.so:
	@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

This is how we have been shipping wheels for years, and it works fine across Windows, Linux and macOS. It seems like a weird thing to do of course (if you know how package managers work but are new to PyPI/wheels) - it’s a long story, but the tl;dr is that PyPI wasn’t designed with non-Python dependencies in mind, so the usual approach is to bundle those all into a wheel (it tends to work, unless you have complex non-Python dependencies). It’d be very much nontrivial to do any kind of unbundling here, and doing so would break situations where numpy and scipy are not installed in the same way (e.g., the former from conda-forge/Homebrew, the latter from PyPI).

Possible root causes

The kernel panic had to do with spin locks apparently. It is not clear if the performance issues are also due to that, or have a completely different root cause. It does seem to be the case that two copies of the same shared library with the same version (all are libopenblas.0.dylib) cause a conflict at the OS level somehow. Anything beyond that is speculation at this point.

Can we work around the problem?

If we release wheels for macOS 12, many people are going to hit this problem. A 50x slowdown for some code using linalg functionality for the default install configuration of pip install numpy scipy does not seem acceptable - that will lead too many users on wild goose chases. On the other hand it should be pointed out that if users build SciPy 1.7.2 from source on a native arm64 Python install, they will anyway hit the same problem. So not releasing any wheels isn’t much better; at best it signals to users that they shouldn’t use arm64 just yet but stick with x86_64 (but that does have some performance implications as well).

At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:

Use threadpoolctl (see the README at https://github.com/joblib/threadpoolctl for how)
Set an environment variable to control the behavior, e.g. OPENBLAS_NUM_THREADS
Rebuild the libopenblas we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn’t have a threadpoolctl runtime dependency, and it doesn’t seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an optional dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

Rebuilding libopenblas with a low max number of threads does not allow users who know what they are doing or don’t suffer from the problem to optimize threading behavior for their own code. It was pointed out in https://github.com/scipy/scipy/issues/14688#issuecomment-969143657 that this is undesirable.

Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in scipy/__init__.py then that may be the most pragmatic solution right now. However, this must be done before libopenblas is first loaded or it won’t take effect. So if users import numpy first, then setting an env var will already have no effect on that copy of libopenblas. It needs testing whether this then still works around the problem or not.

Note: I wanted to have everything in one place, but let’s discuss the release strategy on the mailing list (link to thread), and the actual performance issue here.

Testing on other macOS arm64 build/install configurations

Request: if you have a build config on macOS arm64 that is not covered by the above summary yet, please run the following and reply on this issue with the results:

% python -m threadpoolctl -i scipy.linalg

% cd /PATH/TO/scipy/linalg
% otool -L _fblas.cpython-*-darwin.so

% cd /PATH/TO/numpy/linalg
% otool -L _umath_linalg.cpython-*-darwin.so

% # Run the reproducer (again, only on macOS 12 - you will trigger an OS
% # crash on macOS 11.x!) and report if the time per `eigsh` call is ~0.02 sec. or ~1 sec.

% pip list    # if using pip for everything
% conda list  # if using conda

Issue Analytics

State:
Created 2 years ago
Comments:55 (45 by maintainers)

Top GitHub Comments

1reaction

ogriselcommented, Nov 29, 2021

Confirmed!

1reaction

psobolewskiPhDcommented, Nov 29, 2021

Hi again, the new wheels for 1.7.3 MacOS arm64 have been uploaded–it would be great to get confirmation that things are looking better performance-wise with these new binaries, which should now be automatically preferred by pip.

I can confirm using the new wheel Downloading scipy-1.7.3-1-cp39-cp39-macosx_12_0_arm64.whl (27.0 MB) In a fresh conda env on M1 macOS 12 fixes the performance issue:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.088 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.014 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s