KMeans singnificantly slower on 0.23
See original GitHub issueDescribe the bug
With the latest changes, KMeans is significantly slower on small datasets. The time needed to compute clusters is around ten times longer.
Steps/Code to Reproduce
Times with the following code are: scikit-lern 0.22: ~0.015 scikit-learn 0.23: ~0.15
import time
import sklearn.cluster
from sklearn import datasets
data = datasets.load_iris()['data']
t = time.time()
sklearn.cluster.KMeans(n_clusters=2).fit(data)
print(time.time() - t)
I also tried on a bigger dataset with shape (300, 25) where clustering with the new version needed 3-4s while before it happened in miliseconds.
Expected Results
Clusters would be computed as fast as before.
Versions
System:
python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:05:27) [Clang 9.0.1 ]
executable: /Users/primoz/miniconda3/envs/orange/bin/python
machine: Darwin-19.0.0-x86_64-i386-64bit
Python dependencies:
pip: 20.1
setuptools: 46.1.3
sklearn: 0.23.0
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.3
matplotlib: 3.2.1
joblib: 0.14.1
Built with OpenMP: True
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (11 by maintainers)
Top Results From Across the Web
Implementing a faster KMeans in scikit-learn 0.23
There is no temporary array but it's slower, because distance computation cannot be vectorized. Besides causing memory issues, a large temporary array does...
Read more >How to speed-up k-means from Scikit learn? - Stack Overflow
I need to boost it. I have tried to change the number of n_jobs to -1 , but still very slow!
Read more >How Slow is the k-Means Method? - Stanford CS Theory
In this paper, we demonstrate that the worst-case running time of k-means is superpolyno- mial by improving the best known lower bound from...
Read more >What to Do When K-Means Clustering Fails - NCBI - NIH
This is because K-means is nested: we can always decrease E by increasing K, even when the true number of clusters is much...
Read more >sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation
For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation. Notes. The k-means problem is ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jeremiedbb thank you for your help. I tested the PR and it works now normally.
Thanks @PrimozGodec ! Closing.