question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KMeans(init='k-means++') performance issue with OpenBLAS

See original GitHub issue

I open this issue to investigate a performance problem that might be related to #17230.

I adapted the reproducer of #17230 to display more info and make it work on a medium-size ranom dataset.

from sklearn import cluster
from time import time
from pprint import pprint
from threadpoolctl import threadpool_info
import numpy as np


pprint(threadpool_info())
rng = np.random.RandomState(0)
data = rng.randn(5000, 50)
t0_global = time()
for k in range(1, 15):
    t0 = time()
    # print(f"Running k-means with k={k}: ", end="", flush=True)
    cluster.KMeans(
        n_clusters=k,
        random_state=42,
        n_init=10,
        max_iter=2000,
        algorithm='full',
        init='k-means++').fit(data)
    # print(f"{time() - t0:.3f} s")

print(f"Total duration: {time() - t0_global:.3f} s")

I tried to run this on Linux with scikit-learn master (therefore including the #16499 fix) with 2 different builds of scipy (with openblas from pypi and MKL from anaconda) and various values for OMP_NUM_THREADS (unset, OMP_NUM_THREADS=1, OMP_NUM_THREADS=2, OMP_NUM_THREADS=4) on a laptop with 2 physical cpu cores (4 logical cpus).

In both cases, I use the same scikit-learn binaries (built with GCC in editable mode). I just change the env.

The summary is:

  • with MKL there is not problem: large or unset values of OMP_NUM_THREADS are faster than OMP_NUM_THREADS=1;
  • with OpenBLAS without explicit setting of OMP_NUM_THREADS or setting a large value for it is significanlty slower forced sequential run with OMP_NUM_THREADS=1.

I will include my runs in the first comment.

/cc @jeremiedbb

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
jeremiedbbcommented, May 25, 2020

I think so but it might be more specific. In k-means++ the matrix multiplication is (n_candidates, n_features) x (n_samples, n_features) and the number of candidates is a small number (~log(n_clusters)). It’s possible that mkl is better at optimizing the matrix multiplication where 1 dimension is much smaller than the other one.

1reaction
jeremiedbbcommented, May 25, 2020

Actually it’s k-means++ init faults ! it scales poorly, especially using hyperthreads. The timings are back to expected when you set the init to ‘random’, meaning that the openmp loop and the inner blas are properly controlled.

Maybe you can turn this issue into a k-means++ issue ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation
The k-means problem is solved using either Lloyd's or Elkan's algorithm. The average complexity is given by O(k n T), where n is...
Read more >
k-Means Clustering: Comparison of Initialization strategies.
k-Means is a data partitioning algorithm which is the most immediate choice as a clustering algorithm. We will explore kmeans++, Forgy and ...
Read more >
K-means Clustering: Algorithm, Applications, Evaluation ...
Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster. The approach kmeans...
Read more >
Centroid Initialization Methods for k-means Clustering
As k-means clustering aims to converge on an optimal set of cluster centers (centroids) and cluster membership based on distance from these ...
Read more >
k-means clustering - MATLAB kmeans - MathWorks
This MATLAB function performs k-means clustering to partition the observations of the n-by-p data matrix X into k clusters, and returns an n-by-1...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found