Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallelism - what do libraries offer, and is there an API aspect to it

See original GitHub issue

Several people have expressed a strong interest in talking about and working on (auto-)parallelization. Here is an attempt at summarizing this topic.

current status
auto-parallelization and nested parallelism
limitations due to Python package distribution mechanisms
The need for a better API pattern or library

Current status

Linear algebra libraries

The main accelerated linear algebra libraries that are in use (for CPU based code) are OpenBLAS and MKL. Both of those libraries auto-parallelize function calls.

OpenBLAS can be built with either its own pthreads-based thread pool, or with OpenMP support. The number of threads can be controlled with an environment variable (OPENBLAS_NUM_THREADS or OMP_NUM_THREADS), or from Python via threadpoolctl. The conda-forge OpenBLAS package uses OpenMP; the OpenBLAS builds linked into NumPy and SciPy wheels on PyPI use pthreads.

MKL supports OpenMP and Intel TBB as the threading control mechanisms. The number of threads can be controlled with an environment variable (MKL_NUM_THREADS or OMP_NUM_THREADS), or from Python with threadpoolctl.

NumPy

NumPy does not provide parallelization, with the exception of linear algebra routines which inherit the auto-parallelization of the underlying library (OpenBLAS or MKL typically). NumPy does however release the GIL consistently where it can.

Scikit-learn

Scikit-learn provides a keyword n_jobs=1 in many estimators and other functions to let users enable parallel execution. This is done via the joblib library, which provides both multiprocessing (default) and threading backends that can be selected with a context manager.

Scikit-learn also contains C and Cython code that uses OpenMP. OpenMP is enabled in both wheels on PyPI and in conda-forge packages. The number of threads used can be controlled with the OMP_NUM_THREADS environment variable.

Scikit-learn has good documentation on parallelism and resource management.

SciPy

SciPy provides a workers=1 keyword in a (still limited) number of functions to let users enable parallel execution. It is similar to scikit-learn’s n_jobs keyword, except that it also accepts a map-like callable (e.g. multiprocess.Pool.map to allow using a custom pool. C++ code in SciPy uses pthreads; the use of OpenMP was discussed and rejected.

scipy.linalg also provides a Cython API for BLAS and LAPACK. This lets other libraries use linear algebra routines without having to ship or build against an accelerated linear algebra library directly. Scikit-learn, statsmodels and other libraries do this - thereby again inheriting the auto-parallelization behavior from OpenBLAS or MKL.

Deep learning frameworks

TensorFlow, PyTorch, MXNet and JAX all have auto-parallelization behavior. Furthermore they provide support for distributed computing (with the exception of JAX). These frameworks are very performance-focused, and aim to optimally use all available hardware. They typically allow building with different backends like NCCL or GLOO for GPU support, and use OpenMP, MPI, gRPC and more.

The advantage these frameworks have is that users typically only use this one framework for their whole program, so the parallelism used can be optimized without having to play well with other Python packages that also execute code in parallel.

Dask

Dask provides parallel arrays, dataframes and machine learning algorithms with APIs that match NumPy, Pandas and scikit-learn as much as possible. Dask is a pure Python library and uses blocked algorithms; each block contains a single NumPy array or Pandas dataframe. Scaling to hundreds of nodes is possible; Dask is a good solution to obtain distributed arrays. When used as a method to obtain parallelism on a single node however, it is not very efficient.

Auto-parallelization and nested parallelism

Some libraries, like the deep learning frameworks, do auto-parallelization. Most non deep learning libraries do not do this. When a single library or framework is used to execute an end user program, auto-parallelization is usually a good thing to have. It uses all available hardware resources in an optimal fashion.

Problems can occur when multiple libraries are involved. What often happens is oversubscription of resources. For example, if an end user would write code using scikit-learn with n_jobs=-1, and NumPy would auto-parallelize operations, then scikit-learn will use N processes (on an N-core machine) and NumPy will use N threads per process - leading to N^2 threads being used. On machines with a large number of cores, the overhead of this quickly becomes problematic. Given that NumPy uses OpenBLAS or MKL, this problem already occurs today. For a while Anaconda and Intel shipped a modified NumPy version that had auto-parallelization behavior for functions other than linear algebra - and the problem occurred more frequently.

The paper Composable Multi-Threading and Multi-Processing for Numeric Libraries from Malakhov et al. contains a good overview with examples and comparisons between different parallelization methods. It uses NumPy, SciPy, Dask, and Numba, and uses multiprocessing, concurrent.futures, OpenMP, Intel TBB (Threading Building Blocks), and a custom library SMP (symmetric multi-processing).

Limitations due to Python package distribution mechanisms

When one wants to use auto-parallelization, it’s important to have control over the complete set of packages that a user gets installed on their machine. That way one can ensure there’s a single linear algebra library installed, and a single OpenMP runtime is used.

That control over the full set of packages is common in HPC type situations, where admins need to deal with build and install requirements to make libraries work well together. Both packages managers (e.g. Apt in Debian) and Conda have the ability to do this right as well - both because of dependency resolution and because of a common build infrastructure.

A large fraction of Python users install packages from PyPI with Pip however. The binary installers (wheels) on PyPI are not built on a common infrastructure, and because there’s no real support for non-Python dependencies, libraries like OpenMP and OpenBLAS are bundled into the wheels and installed into end user environments multiple times. This makes it very difficult to reliably use, e.g., OpenMP. For this reason SciPy uses custom pthreads thread pools rather than OpenMP.

The need for a better API pattern or library

The default behavior for libraries like NumPy and SciPy given the status of the ecosystem today should be to be single-threaded, otherwise it composes badly with multiprocessing, scikit-learn (joblib), Dask, etc. However, there’s room for improvement here. Two things that could help improve the coordination of parallelization behavior in a stack of Python libraries are:

A common API pattern for enabling parallelism
A common library providing a parallelization layer

A common API pattern is the simpler of the two options. It could be a keyword like n_jobs or workers that gets used consistently between libraries, or a context manager to achieve the same level of per-function or per-code-block control.

A common library would be more powerful and enable auto-parallelization rather than giving the user control (which is what the API pattern does). From a performance perspective, having arrays and dataframes auto-parallelize their functions as much as possible over all cores on a single node, and then letting a separate library like Dask deal with multi-node coordination, seems optimal. Introducing a new dependency into multiple libraries at the core of the PyData ecosystem is a nontrivial exercise however.

The above attempts to summarize the state of affairs today. The topic of parallelization is largely an implementation rather than an API question, however there is an API component to it with option (1) above. How to move forward here is worth discussing.

Note: there’s also a lot of room left in NumPy also for optimizing single-threaded performance. There’s ongoing work on making better use of intrinsics (this is a large effort, ongoing), or using SLEEF for vector math (discussed in the past, no one is working on it).

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

jack-pappascommented, Jun 16, 2020

@aregm posted a nice writeup / summary on threading APIs in another thread, and it feels very relevant to this discussion of parallelism as well:

Omni Parallel Runtime_New.pdf

1reaction

amuellercommented, Jun 4, 2020

sklearn now actually uses threadpoolctl internally to make some computations parallel by default, such as in HistGradientBoostingClassifier and makes sure others are not parallel by setting jobs to 1. There is some issues with nesting, and there is issues with finding the right number of threads. Right now we use the number of (virtual) cores which often seems to be a bad idea, and the physical cores might be better. I don’t think we have an entirely consistent story about the interactions between n_jobs and our use of OpenMP.

So in conclusion: just in scikit-learn, this is already a mess, ‘only’ dealing with 4 types of parallelism (n_jobs processes, n_jobs threads, OpenMP and BLAS). We could have our own ‘library’ solution, but I don’t think anyone of us has the expertise to do this; it’s probably pretty hard to actually know how to allocate cores across different ML algorithms. I’m not sure where to even start on that.

I’m not sure I understand proposal 2: is that a python library? How would that integrate with the C & Fortran code? If it’s a C library: how does it integrate with numba?

Top Results From Across the Web

How do Developers use Parallel Libraries? | Request PDF

We present the first study that analyzes the usage of parallel libraries in a large scale experiment. We analyzed 655 open-source ...

Concurrency with Python: Hardware-Based Parallelism

Python's ability to leverage hardware-based parallelism, and the resultant development of highly performant and composable libraries, is a big ...

Async Programming with Task Parallel Library - C# Guide

Why the Task Parallel Library Should Matter to You ... the Task Parallel Library? The TPL is a set of software APIs in...

Dataflow (Task Parallel Library) - Microsoft Learn

The TPL Dataflow Library provides a foundation for message passing and parallelizing CPU-intensive and I/O-intensive applications that have high ...

AOmpLib: An Aspect Library for Large-Scale Multi-Core Parallel ...

to introduce parallelism in a non-intrusive manner. The. AOmpLib provides a library of aspect modules that mimics OpenMP constructs and that can be...