KMeans(init='k-means++') performance issue with OpenBLAS
See original GitHub issueI open this issue to investigate a performance problem that might be related to #17230.
I adapted the reproducer of #17230 to display more info and make it work on a medium-size ranom dataset.
from sklearn import cluster
from time import time
from pprint import pprint
from threadpoolctl import threadpool_info
import numpy as np
pprint(threadpool_info())
rng = np.random.RandomState(0)
data = rng.randn(5000, 50)
t0_global = time()
for k in range(1, 15):
t0 = time()
# print(f"Running k-means with k={k}: ", end="", flush=True)
cluster.KMeans(
n_clusters=k,
random_state=42,
n_init=10,
max_iter=2000,
algorithm='full',
init='k-means++').fit(data)
# print(f"{time() - t0:.3f} s")
print(f"Total duration: {time() - t0_global:.3f} s")
I tried to run this on Linux with scikit-learn master (therefore including the #16499 fix) with 2 different builds of scipy (with openblas from pypi and MKL from anaconda) and various values for OMP_NUM_THREADS
(unset, OMP_NUM_THREADS=1
, OMP_NUM_THREADS=2
, OMP_NUM_THREADS=4
) on a laptop with 2 physical cpu cores (4 logical cpus).
In both cases, I use the same scikit-learn binaries (built with GCC in editable mode). I just change the env.
The summary is:
- with MKL there is not problem: large or unset values of
OMP_NUM_THREADS
are faster thanOMP_NUM_THREADS=1
; - with OpenBLAS without explicit setting of
OMP_NUM_THREADS
or setting a large value for it is significanlty slower forced sequential run withOMP_NUM_THREADS=1
.
I will include my runs in the first comment.
/cc @jeremiedbb
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation
The k-means problem is solved using either Lloyd's or Elkan's algorithm. The average complexity is given by O(k n T), where n is...
Read more >k-Means Clustering: Comparison of Initialization strategies.
k-Means is a data partitioning algorithm which is the most immediate choice as a clustering algorithm. We will explore kmeans++, Forgy and ...
Read more >K-means Clustering: Algorithm, Applications, Evaluation ...
Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster. The approach kmeans...
Read more >Centroid Initialization Methods for k-means Clustering
As k-means clustering aims to converge on an optimal set of cluster centers (centroids) and cluster membership based on distance from these ...
Read more >k-means clustering - MATLAB kmeans - MathWorks
This MATLAB function performs k-means clustering to partition the observations of the n-by-p data matrix X into k clusters, and returns an n-by-1...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think so but it might be more specific. In k-means++ the matrix multiplication is
(n_candidates, n_features) x (n_samples, n_features)
and the number of candidates is a small number (~log(n_clusters)
). It’s possible that mkl is better at optimizing the matrix multiplication where 1 dimension is much smaller than the other one.Actually it’s k-means++ init faults ! it scales poorly, especially using hyperthreads. The timings are back to expected when you set the init to ‘random’, meaning that the openmp loop and the inner blas are properly controlled.
Maybe you can turn this issue into a k-means++ issue ?