The actual number of clusters returned by minibatch-kmeans is less than the specify number of clusters
See original GitHub issueDescription
i use minibatchkmeans and set k=2000, but, the number of clusters that minibatchkmeans returns is 1997, that is less than 2000. Then, i set k=1950, the minibatchkmeans return 1947, and it is less than 1950. I have over 16 million samples of datasets and each sample have 150 features.
Steps/Code to Reproduce
Expected Results
when set k =2000, return 2000
Actual Results
1947
Versions
sklean 0.20.1 system: centos6.3 scipy 1.1.0 numpy 1.14.5
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Mini-batch k-means returns less than k clusters - Stack Overflow
I noticed that the algorithm has trouble returning the specified number of clusters as k increases, and if k goes beyond about 30%...
Read more >sklearn.cluster.MiniBatchKMeans
The number of clusters to form as well as the number of centroids to generate. ... you can set the batch_size greater than...
Read more >An introduction to mbkmeans - Bioconductor
The number of clusters (such as k in the k-means algorithm) is set through the clusters argument. In this case, we set clusters...
Read more >ClusterR: Gaussian Mixture Models, K-Means, Mini-Batch ...
The affinity propagation algorithm automatically determines the number of clusters based on the input preference p, a real-valued N-vector.
Read more >In Depth: k-Means Clustering | Python Data Science Handbook
The k-means algorithm searches for a pre-determined number of clusters ... Each point is closer to its own cluster center than to other...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Here’s a small reproducible example:
although
mbk.cluster_centers_
contains no duplicate.Notice that it’s not that
MiniBatchKMeans
finds a lower number of clusters, but that at predict time all samples are predicted in less clusters than the number of clusters used to fit the data. It’s might seem surprising given that you fit and predict on the same data, but actually you don’t fit on the same data. When using batches, you might take a same samples several times and never take another sample.Here, I took a batch size of 10, but the default
max_iter
is 100 so at most 1000 samples have been seen, meaning at least 500 samples have not been seen.Besides the inaccurate number of predicted clusters, it indicates a probably mediocre quality of the solution.
The solution would be either to increase your batch size (as you mentionned) or to increase the number of iterations (you’ll probably need to increase
max_no_improvement
as well in that case).yes, you are right, thank you for helping me solve this problem, thank you again.