question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The actual number of clusters returned by minibatch-kmeans is less than the specify number of clusters

See original GitHub issue

Description

i use minibatchkmeans and set k=2000, but, the number of clusters that minibatchkmeans returns is 1997, that is less than 2000. Then, i set k=1950, the minibatchkmeans return 1947, and it is less than 1950. I have over 16 million samples of datasets and each sample have 150 features.

Steps/Code to Reproduce

Expected Results

when set k =2000, return 2000

Actual Results

1947

Versions

sklean 0.20.1 system: centos6.3 scipy 1.1.0 numpy 1.14.5

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jeremiedbbcommented, Mar 20, 2019

Here’s a small reproducible example:

X, _ = make_blobs(n_samples=1500, n_features=10, random_state=10)                                                                                   
mbk = MiniBatchKMeans(n_clusters=20, batch_size=10, random_state=0)                                   

mbk.fit(X)                                                                                                                         
y_pred = mbk.predict(X)                                                                                                            

len(set(y_pred))                                                                                                                   
>>> 19

although mbk.cluster_centers_ contains no duplicate.

Notice that it’s not that MiniBatchKMeans finds a lower number of clusters, but that at predict time all samples are predicted in less clusters than the number of clusters used to fit the data. It’s might seem surprising given that you fit and predict on the same data, but actually you don’t fit on the same data. When using batches, you might take a same samples several times and never take another sample.

Here, I took a batch size of 10, but the default max_iter is 100 so at most 1000 samples have been seen, meaning at least 500 samples have not been seen.

Besides the inaccurate number of predicted clusters, it indicates a probably mediocre quality of the solution.

The solution would be either to increase your batch size (as you mentionned) or to increase the number of iterations (you’ll probably need to increase max_no_improvement as well in that case).

0reactions
tz28commented, Mar 20, 2019

Here’s a small reproducible example:

X, _ = make_blobs(n_samples=1500, n_features=10, random_state=10)                                                                                   
mbk = MiniBatchKMeans(n_clusters=20, batch_size=10, random_state=0)                                   

mbk.fit(X)                                                                                                                         
y_pred = mbk.predict(X)                                                                                                            

len(set(y_pred))                                                                                                                   
>>> 19

although mbk.cluster_centers_ contains no duplicate.

Notice that it’s not that MiniBatchKMeans finds a lower number of clusters, but that at predict time all samples are predicted in less clusters than the number of clusters used to fit the data. It’s might seem surprising given that you fit and predict on the same data, but actually you don’t fit on the same data. When using batches, you might take a same samples several times and never take another sample.

Here, I took a batch size of 10, but the default max_iter is 100 so at most 1000 samples have been seen, meaning at least 500 samples have not been seen.

Besides the inaccurate number of predicted clusters, it indicates a probably mediocre quality of the solution.

The solution would be either to increase your batch size (as you mentionned) or to increase the number of iterations (you’ll probably need to increase max_no_improvement as well in that case).

yes, you are right, thank you for helping me solve this problem, thank you again.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Mini-batch k-means returns less than k clusters - Stack Overflow
I noticed that the algorithm has trouble returning the specified number of clusters as k increases, and if k goes beyond about 30%...
Read more >
sklearn.cluster.MiniBatchKMeans
The number of clusters to form as well as the number of centroids to generate. ... you can set the batch_size greater than...
Read more >
An introduction to mbkmeans - Bioconductor
The number of clusters (such as k in the k-means algorithm) is set through the clusters argument. In this case, we set clusters...
Read more >
ClusterR: Gaussian Mixture Models, K-Means, Mini-Batch ...
The affinity propagation algorithm automatically determines the number of clusters based on the input preference p, a real-valued N-vector.
Read more >
In Depth: k-Means Clustering | Python Data Science Handbook
The k-means algorithm searches for a pre-determined number of clusters ... Each point is closer to its own cluster center than to other...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found