Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DOC: Implicit parallelism and how to disable it

See original GitHub issue

Questions:

Apparently OpenMP (or BLAS?) is used in pandas internals (or one of its dependencies), at least in some cases. For example, value_counts() uses all CPU cores on my machine (but does not benefit from it).

Is this expected behavior?
Is there any documentation on which functions are parallelized, and how to disable it if desired?

Details:

Here’s a test program that repeatedly calls value_counts() on a large array:

# test.py
import numpy as np
import pandas as pd

a = np.random.randint(10, size=20_000_000)
s = pd.Series(a)

for _ in range(50):
    c = s.value_counts(sort=True)

From monitoring htop I discovered that value_counts() is using multiple CPU cores.

Apparently the parallelism can be controlled via OMP_NUM_THREADS, as shown here in the difference between real and user time when the number of threads is changed.

$ time OMP_NUM_THREADS=1 python test.py

real	0m8.714s
user	0m6.715s
sys	0m1.653s

$ time OMP_NUM_THREADS=10 python /test.py

real	0m7.921s
user	0m58.198s
sys	0m7.345s

This was a surprise to me. I know that some operations in np.linalg are implicitly parallelized via the BLAS implementation, but I would not have guessed that value_counts() used any of those functions. (Am I mistaken?)

This behavior is especially undesirable when one is already parallelizing one’s code via multiprocessing (or, in my case, pyspark). In that case, the multiple threads in each process compete with each other, causing thrashing, etc.

Is setting the OMP_NUM_THREADS variable the recommended way to disable this behavior? Thanks in advance for any tips!

(Side note: As you can see, adding more threads apparently does not provide much benefit even in this simple case – it just eats more CPU time without saving wall time.)

Appendix

This issue can be reproduced by simply installing the latest version of pandas with conda, but FWIW, here are the versions I’m using:

$ conda list | grep -E '(^python )|(^numpy)|(^pandas)|(blas)'
blas                      1.0                         mkl
numpy                     1.13.3           py36hdbf6ddf_4
openblas                  0.2.20                        7    conda-forge
pandas                    0.23.4           py36hf8a1672_0    conda-forge
python                    3.6.3                h1284df2_4

Issue Analytics

State:
Created 5 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

jrebackcommented, Oct 19, 2018

a section in FAQ i think would be good

0reactions

stuartebergcommented, Oct 19, 2018

the best we can do is put a strong warning in the documentation to encourage users to use OMP_NUM_THREADS=1 before they run their code.

I’ll write something up. Which section(s) of the docs should I edit? I think the following sections are candidates:

Top Results From Across the Web

OpenMP Information - CSE, IIT Delhi

A most popular approach of implicit parallelism is the automatic parallelization of sequential programs by compilers. The compiler performs dependence analysis ...

Implicit parallelism - Wikipedia

In computer science, implicit parallelism is a characteristic of a programming language that allows a compiler or interpreter to automatically exploit the ...

Scaling implicit parallelism via dynamic control replication

We present dynamic control replication, a run-time program analysis that enables scalable execution of implicitly parallel programs on large ...

Parallel Computing: Theory and Practice

Writing Multithreaded Programs: Structured or Implicit Multithreading. Interface such as Pthreads enable the programmer to create a wide variety of ...

Automatically Parallelize for Loops in Generated Code

You might want to disable automatic parallelization of a particular loop if that loop performs better in serial execution. To prevent parallelization of...