Parallelism - what do libraries offer, and is there an API aspect to it
See original GitHub issueSeveral people have expressed a strong interest in talking about and working on (auto-)parallelization. Here is an attempt at summarizing this topic.
- current status
- auto-parallelization and nested parallelism
- limitations due to Python package distribution mechanisms
- The need for a better API pattern or library
Current status
Linear algebra libraries
The main accelerated linear algebra libraries that are in use (for CPU based code) are OpenBLAS and MKL. Both of those libraries auto-parallelize function calls.
OpenBLAS can be built with either its own pthreads-based thread pool, or with
OpenMP support. The number of threads can be controlled with an environment
variable (OPENBLAS_NUM_THREADS
or OMP_NUM_THREADS
), or from Python via
threadpoolctl. The conda-forge
OpenBLAS package uses OpenMP; the OpenBLAS builds linked into NumPy and SciPy
wheels on PyPI use pthreads.
MKL supports OpenMP and Intel TBB as the threading control mechanisms. The
number of threads can be controlled with an environment variable
(MKL_NUM_THREADS
or OMP_NUM_THREADS
), or from Python with threadpoolctl
.
NumPy
NumPy does not provide parallelization, with the exception of linear algebra routines which inherit the auto-parallelization of the underlying library (OpenBLAS or MKL typically). NumPy does however release the GIL consistently where it can.
Scikit-learn
Scikit-learn provides a keyword n_jobs=1
in many estimators and other
functions to let users enable parallel execution. This is done via the
joblib library, which provides both
multiprocessing (default) and threading backends that can be selected with a
context manager.
Scikit-learn also contains C and Cython code that uses OpenMP. OpenMP is
enabled in both wheels on PyPI and in conda-forge packages. The number of
threads used can be controlled with the OMP_NUM_THREADS
environment variable.
Scikit-learn has good documentation on parallelism and resource management.
SciPy
SciPy provides a workers=1
keyword in a (still limited) number of functions
to let users enable parallel execution. It is similar to scikit-learn’s
n_jobs
keyword, except that it also accepts a map
-like callable (e.g.
multiprocess.Pool.map
to allow using a custom pool. C++ code in SciPy uses
pthreads; the use of OpenMP was
discussed and rejected.
scipy.linalg
also provides a Cython API for BLAS and LAPACK. This lets other
libraries use linear algebra routines without having to ship or build against
an accelerated linear algebra library directly. Scikit-learn, statsmodels and
other libraries do this - thereby again inheriting the auto-parallelization
behavior from OpenBLAS or MKL.
Deep learning frameworks
TensorFlow, PyTorch, MXNet and JAX all have auto-parallelization behavior. Furthermore they provide support for distributed computing (with the exception of JAX). These frameworks are very performance-focused, and aim to optimally use all available hardware. They typically allow building with different backends like NCCL or GLOO for GPU support, and use OpenMP, MPI, gRPC and more.
The advantage these frameworks have is that users typically only use this one framework for their whole program, so the parallelism used can be optimized without having to play well with other Python packages that also execute code in parallel.
Dask
Dask provides parallel arrays, dataframes and machine learning algorithms with APIs that match NumPy, Pandas and scikit-learn as much as possible. Dask is a pure Python library and uses blocked algorithms; each block contains a single NumPy array or Pandas dataframe. Scaling to hundreds of nodes is possible; Dask is a good solution to obtain distributed arrays. When used as a method to obtain parallelism on a single node however, it is not very efficient.
Auto-parallelization and nested parallelism
Some libraries, like the deep learning frameworks, do auto-parallelization. Most non deep learning libraries do not do this. When a single library or framework is used to execute an end user program, auto-parallelization is usually a good thing to have. It uses all available hardware resources in an optimal fashion.
Problems can occur when multiple libraries are involved. What often happens is
oversubscription of resources. For example, if an end user would write code
using scikit-learn with n_jobs=-1
, and NumPy would auto-parallelize
operations, then scikit-learn will use N
processes (on an N
-core machine)
and NumPy will use N
threads per process - leading to N^2
threads being
used. On machines with a large number of cores, the overhead of this quickly
becomes problematic. Given that NumPy uses OpenBLAS or MKL, this problem
already occurs today. For a while Anaconda and Intel shipped a modified NumPy
version that had auto-parallelization behavior for functions other than linear
algebra - and the problem occurred more frequently.
The paper Composable Multi-Threading and Multi-Processing for Numeric
Libraries
from Malakhov et al. contains a good overview with examples and comparisons
between different parallelization methods. It uses NumPy, SciPy, Dask, and
Numba, and uses multiprocessing
, concurrent.futures
, OpenMP, Intel TBB
(Threading Building Blocks), and a custom library SMP (symmetric
multi-processing).
Limitations due to Python package distribution mechanisms
When one wants to use auto-parallelization, it’s important to have control over the complete set of packages that a user gets installed on their machine. That way one can ensure there’s a single linear algebra library installed, and a single OpenMP runtime is used.
That control over the full set of packages is common in HPC type situations, where admins need to deal with build and install requirements to make libraries work well together. Both packages managers (e.g. Apt in Debian) and Conda have the ability to do this right as well - both because of dependency resolution and because of a common build infrastructure.
A large fraction of Python users install packages from PyPI with Pip however. The binary installers (wheels) on PyPI are not built on a common infrastructure, and because there’s no real support for non-Python dependencies, libraries like OpenMP and OpenBLAS are bundled into the wheels and installed into end user environments multiple times. This makes it very difficult to reliably use, e.g., OpenMP. For this reason SciPy uses custom pthreads thread pools rather than OpenMP.
The need for a better API pattern or library
The default behavior for libraries like NumPy and SciPy given the status of the ecosystem today should be to be single-threaded, otherwise it composes badly with multiprocessing, scikit-learn (joblib), Dask, etc. However, there’s room for improvement here. Two things that could help improve the coordination of parallelization behavior in a stack of Python libraries are:
- A common API pattern for enabling parallelism
- A common library providing a parallelization layer
A common API pattern is the simpler of the two options. It could be a keyword
like n_jobs
or workers
that gets used consistently between libraries, or a
context manager to achieve the same level of per-function or per-code-block
control.
A common library would be more powerful and enable auto-parallelization rather than giving the user control (which is what the API pattern does). From a performance perspective, having arrays and dataframes auto-parallelize their functions as much as possible over all cores on a single node, and then letting a separate library like Dask deal with multi-node coordination, seems optimal. Introducing a new dependency into multiple libraries at the core of the PyData ecosystem is a nontrivial exercise however.
The above attempts to summarize the state of affairs today. The topic of parallelization is largely an implementation rather than an API question, however there is an API component to it with option (1) above. How to move forward here is worth discussing.
Note: there’s also a lot of room left in NumPy also for optimizing single-threaded performance. There’s ongoing work on making better use of intrinsics (this is a large effort, ongoing), or using SLEEF for vector math (discussed in the past, no one is working on it).
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:9 (3 by maintainers)
Top GitHub Comments
@aregm posted a nice writeup / summary on threading APIs in another thread, and it feels very relevant to this discussion of parallelism as well:
Omni Parallel Runtime_New.pdf
sklearn now actually uses
threadpoolctl
internally to make some computations parallel by default, such as inHistGradientBoostingClassifier
and makes sure others are not parallel by setting jobs to 1. There is some issues with nesting, and there is issues with finding the right number of threads. Right now we use the number of (virtual) cores which often seems to be a bad idea, and the physical cores might be better. I don’t think we have an entirely consistent story about the interactions betweenn_jobs
and our use of OpenMP.So in conclusion: just in scikit-learn, this is already a mess, ‘only’ dealing with 4 types of parallelism (n_jobs processes, n_jobs threads, OpenMP and BLAS). We could have our own ‘library’ solution, but I don’t think anyone of us has the expertise to do this; it’s probably pretty hard to actually know how to allocate cores across different ML algorithms. I’m not sure where to even start on that.
I’m not sure I understand proposal 2: is that a python library? How would that integrate with the C & Fortran code? If it’s a C library: how does it integrate with numba?