DOC: Implicit parallelism and how to disable it
See original GitHub issueQuestions:
Apparently OpenMP (or BLAS?) is used in pandas internals (or one of its dependencies), at least in some cases. For example, value_counts()
uses all CPU cores on my machine (but does not benefit from it).
- Is this expected behavior?
- Is there any documentation on which functions are parallelized, and how to disable it if desired?
Details:
Here’s a test program that repeatedly calls value_counts()
on a large array:
# test.py
import numpy as np
import pandas as pd
a = np.random.randint(10, size=20_000_000)
s = pd.Series(a)
for _ in range(50):
c = s.value_counts(sort=True)
From monitoring htop
I discovered that value_counts()
is using multiple CPU cores.
Apparently the parallelism can be controlled via OMP_NUM_THREADS
, as shown here in the difference between real
and user
time when the number of threads is changed.
$ time OMP_NUM_THREADS=1 python test.py
real 0m8.714s
user 0m6.715s
sys 0m1.653s
$ time OMP_NUM_THREADS=10 python /test.py
real 0m7.921s
user 0m58.198s
sys 0m7.345s
This was a surprise to me. I know that some operations in np.linalg
are implicitly parallelized via the BLAS implementation, but I would not have guessed that value_counts()
used any of those functions. (Am I mistaken?)
This behavior is especially undesirable when one is already parallelizing one’s code via multiprocessing (or, in my case, pyspark
). In that case, the multiple threads in each process compete with each other, causing thrashing, etc.
Is setting the OMP_NUM_THREADS
variable the recommended way to disable this behavior? Thanks in advance for any tips!
(Side note: As you can see, adding more threads apparently does not provide much benefit even in this simple case – it just eats more CPU time without saving wall time.)
Appendix
This issue can be reproduced by simply installing the latest version of pandas
with conda, but FWIW, here are the versions I’m using:
$ conda list | grep -E '(^python )|(^numpy)|(^pandas)|(blas)'
blas 1.0 mkl
numpy 1.13.3 py36hdbf6ddf_4
openblas 0.2.20 7 conda-forge
pandas 0.23.4 py36hf8a1672_0 conda-forge
python 3.6.3 h1284df2_4
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (13 by maintainers)
Top GitHub Comments
a section in FAQ i think would be good
I’ll write something up. Which section(s) of the docs should I edit? I think the following sections are candidates: