Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

linalg.solve slower with more CPUs

See original GitHub issue

This issue was first reported here.

Reproducing code example:

Add this to solve_speed_test.py

import numpy as np

# Define out of function so we don't time it.
flen = 512
G = np.random.randn(flen, flen)
D = np.random.randn(flen)

def time_solve():
    # List comprehension is to get max CPU usage.
    [np.linalg.solve(G, D) for _ in range(10)]

Then, time it using all CPUs

python -m timeit -s "from solve_speed_test import time_solve" "time_solve()"

and with only one CPU

taskset --cpu-list 0 python -m timeit -s "from solve_speed_test import time_solve" "time_solve()"

These are the results I get on a 32 CPU machine:

mpariente@grcinq-29:tmp$ nproc --all
32
mpariente@grcinq-29:tmp$ taskset --cpu-list 0 python -m timeit -s "from solve_speed_test import time_solve" "time_solve()"                                              
10 loops, best of 3: 74 msec per loop
mpariente@grcinq-29:tmp$ python -m timeit -s "from solve_speed_test import time_solve" "time_solve()"                                                                   
10 loops, best of 3: 1.42 sec per loop

This is also replicated on three others with different specs.

Numpy/Python version information:

Python 3.6
Numpy 1.18.1

Maybe this is a know issue, and it depends a lot on the shape but we’d like some help on this.

Issue Analytics

State:
Created 4 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

jluciercommented, Feb 3, 2021

Answers / responses to the below statement are welcome, but certainly not expected. I’d just like to document my experience for anyone else unfortunate enough to stumble here, and maybe get some insight into what the hell is going on haha.

Long story very very short, a lot of my company’s signal processing seems to end up in these “BLAS” functions called by Numpy or Scipy when trying to solve matrices. We use scipy.signals.firwin and scipy.signals.filtfilt a lot.

Anyway, I was able to see a 24x speedup simply by setting OMP_NUM_THREADS=1. I suspect that OpenBLAS is just trying too hard to parallelize on the relatively small inputs we’re giving it? I don’t have the energy to dig into this deeply, but this seems like a pretty ridiculous experience to have had.

Evidently, this issue has lurked since my company upgraded to Numpy / Scipy versions which bundled OpenBLAS. Fortunately, our overall system’s performance was not dominated by how fast we can solve matrices, but clearly one particular module was heavily. We didn’t have this issue when when compiling Numpy / Scipy against ATLAS or libblas and liblapack.

This leaves me with a question: why did the maintainers elect to bundle OpenBLAS with Numpy and Scipy? Is it really efficient for large inputs? Or other circumstances? I suppose I’m just blown away how much worse it is in our use case. Am I just dumb, and missed a section of the doc on performance tuning?

1reaction

mattipcommented, Mar 17, 2020

You may be interested in the python package threadpoolctl which tries to provide a convenient API for such tasks. I am not sure that OMP_NUM_THREADS is the correct way to set threads for MKL.

The lack of speedup when going from --cpu-list 0 to --cpu-list 0,1 might indicate that you have not disabled hyperthreading so effectively are running with the same resources in both cases. It seems your sample size is best served on your machine with 4 (effectively 2) CPUs.

Is it normal behaviour?

Unfortunately yes. Strategies inside software like MKL or OpenBLAS for splitting tasks between CPUs may not be optimal; they need to balance task size, memory bandwidth, CPU caching and resource availability. There are no set interfaces for querying the machine across vendors and operating systems, so libraries are forced to choose generic strategies that don’t always provide optimal results. Some people with high end machines end up using their own heuristics and OS-level tools like taskset to partition the machine optimally. Tools like Intel’s VTune, AMD’s uProf or Concurrency Visualizer (part of Visual Studio) can help determine when resource contention is slowing down your task. The fact that such tools exist and that people are willing to pay the license fees to use them show this is not a solved problem. Note that using GPGPUs also suffer from these types of problems, and GPU vendors also supply tools to analyze where slowdowns occur.