linalg.solve slower with more CPUs
See original GitHub issueThis issue was first reported here.
Reproducing code example:
Add this to solve_speed_test.py
import numpy as np
# Define out of function so we don't time it.
flen = 512
G = np.random.randn(flen, flen)
D = np.random.randn(flen)
def time_solve():
    # List comprehension is to get max CPU usage.
    [np.linalg.solve(G, D) for _ in range(10)]
Then, time it using all CPUs
python -m timeit -s "from solve_speed_test import time_solve" "time_solve()"
and with only one CPU
taskset --cpu-list 0 python -m timeit -s "from solve_speed_test import time_solve" "time_solve()"
These are the results I get on a 32 CPU machine:
mpariente@grcinq-29:tmp$ nproc --all
32
mpariente@grcinq-29:tmp$ taskset --cpu-list 0 python -m timeit -s "from solve_speed_test import time_solve" "time_solve()"                                              
10 loops, best of 3: 74 msec per loop
mpariente@grcinq-29:tmp$ python -m timeit -s "from solve_speed_test import time_solve" "time_solve()"                                                                   
10 loops, best of 3: 1.42 sec per loop
This is also replicated on three others with different specs.
Numpy/Python version information:
- Python 3.6
- Numpy 1.18.1
Maybe this is a know issue, and it depends a lot on the shape but we’d like some help on this.
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (5 by maintainers)
 Top Results From Across the Web
Top Results From Across the Web
numpy.linalg.solve is 6x faster on my Mac than on my desktop ...
So my Mac runs numpy.linalg.solve a LOT faster than my desktop, despite the slower CPU. It's at LEAST 6 times faster and I'm...
Read more >Python multiprocessing gives slower speed as more cores are ...
The issue was as follows: When scipy.sparse diagonalises a matrix bigger than some threshold, then it automatically multithreads (which I ...
Read more >Could a server with 64 cores be 100x slower than my laptop?
In this post I cover a detective story how misconfigured BLAS slowed down scipy and numpy linear algebra operations (inverse matrix .inv) ...
Read more >Pure Python vs NumPy vs TensorFlow Performance Comparison
A performance comparison between pure Python, NumPy, and TensorFlow using a simple linear regression algorithm.
Read more >torch.linalg.lstsq — PyTorch 1.13 documentation
If None , 'gelsy' is used for CPU inputs and 'gels' for CUDA inputs. Default: None . Returns: A named tuple (solution, residuals,...
Read more > Top Related Medium Post
Top Related Medium Post
No results found
 Top Related StackOverflow Question
Top Related StackOverflow Question
No results found
 Troubleshoot Live Code
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free Top Related Reddit Thread
Top Related Reddit Thread
No results found
 Top Related Hackernoon Post
Top Related Hackernoon Post
No results found
 Top Related Tweet
Top Related Tweet
No results found
 Top Related Dev.to Post
Top Related Dev.to Post
No results found
 Top Related Hashnode Post
Top Related Hashnode Post
No results found

Answers / responses to the below statement are welcome, but certainly not expected. I’d just like to document my experience for anyone else unfortunate enough to stumble here, and maybe get some insight into what the hell is going on haha.
Long story very very short, a lot of my company’s signal processing seems to end up in these “BLAS” functions called by Numpy or Scipy when trying to solve matrices. We use
scipy.signals.firwinandscipy.signals.filtfilta lot.Anyway, I was able to see a 24x speedup simply by setting
OMP_NUM_THREADS=1. I suspect that OpenBLAS is just trying too hard to parallelize on the relatively small inputs we’re giving it? I don’t have the energy to dig into this deeply, but this seems like a pretty ridiculous experience to have had.Evidently, this issue has lurked since my company upgraded to Numpy / Scipy versions which bundled OpenBLAS. Fortunately, our overall system’s performance was not dominated by how fast we can solve matrices, but clearly one particular module was heavily. We didn’t have this issue when when compiling Numpy / Scipy against ATLAS or
libblasandliblapack.This leaves me with a question: why did the maintainers elect to bundle OpenBLAS with Numpy and Scipy? Is it really efficient for large inputs? Or other circumstances? I suppose I’m just blown away how much worse it is in our use case. Am I just dumb, and missed a section of the doc on performance tuning?
You may be interested in the python package threadpoolctl which tries to provide a convenient API for such tasks. I am not sure that
OMP_NUM_THREADSis the correct way to set threads for MKL.The lack of speedup when going from
--cpu-list 0to--cpu-list 0,1might indicate that you have not disabled hyperthreading so effectively are running with the same resources in both cases. It seems your sample size is best served on your machine with 4 (effectively 2) CPUs.Unfortunately yes. Strategies inside software like MKL or OpenBLAS for splitting tasks between CPUs may not be optimal; they need to balance task size, memory bandwidth, CPU caching and resource availability. There are no set interfaces for querying the machine across vendors and operating systems, so libraries are forced to choose generic strategies that don’t always provide optimal results. Some people with high end machines end up using their own heuristics and OS-level tools like taskset to partition the machine optimally. Tools like Intel’s VTune, AMD’s uProf or Concurrency Visualizer (part of Visual Studio) can help determine when resource contention is slowing down your task. The fact that such tools exist and that people are willing to pay the license fees to use them show this is not a solved problem. Note that using GPGPUs also suffer from these types of problems, and GPU vendors also supply tools to analyze where slowdowns occur.