Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The actual number of clusters returned by minibatch-kmeans is less than the specify number of clusters

See original GitHub issue

Description

i use minibatchkmeans and set k=2000, but, the number of clusters that minibatchkmeans returns is 1997, that is less than 2000. Then, i set k=1950, the minibatchkmeans return 1947, and it is less than 1950. I have over 16 million samples of datasets and each sample have 150 features.

Steps/Code to Reproduce

Expected Results

when set k =2000, return 2000

Actual Results

1947

Versions

sklean 0.20.1 system: centos6.3 scipy 1.1.0 numpy 1.14.5

Issue Analytics

State:
Created 5 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

2reactions

jeremiedbbcommented, Mar 20, 2019

Here’s a small reproducible example:

X, _ = make_blobs(n_samples=1500, n_features=10, random_state=10)                                                                                   
mbk = MiniBatchKMeans(n_clusters=20, batch_size=10, random_state=0)                                   

mbk.fit(X)                                                                                                                         
y_pred = mbk.predict(X)                                                                                                            

len(set(y_pred))                                                                                                                   
>>> 19

although mbk.cluster_centers_ contains no duplicate.

Notice that it’s not that MiniBatchKMeans finds a lower number of clusters, but that at predict time all samples are predicted in less clusters than the number of clusters used to fit the data. It’s might seem surprising given that you fit and predict on the same data, but actually you don’t fit on the same data. When using batches, you might take a same samples several times and never take another sample.

Here, I took a batch size of 10, but the default max_iter is 100 so at most 1000 samples have been seen, meaning at least 500 samples have not been seen.

Besides the inaccurate number of predicted clusters, it indicates a probably mediocre quality of the solution.

The solution would be either to increase your batch size (as you mentionned) or to increase the number of iterations (you’ll probably need to increase max_no_improvement as well in that case).

0reactions

tz28commented, Mar 20, 2019

Here’s a small reproducible example:
X, _ = make_blobs(n_samples=1500, n_features=10, random_state=10)                                                                                   
mbk = MiniBatchKMeans(n_clusters=20, batch_size=10, random_state=0)                                   

mbk.fit(X)                                                                                                                         
y_pred = mbk.predict(X)                                                                                                            

len(set(y_pred))                                                                                                                   
>>> 19
although mbk.cluster_centers_ contains no duplicate.

Notice that it’s not that MiniBatchKMeans finds a lower number of clusters, but that at predict time all samples are predicted in less clusters than the number of clusters used to fit the data. It’s might seem surprising given that you fit and predict on the same data, but actually you don’t fit on the same data. When using batches, you might take a same samples several times and never take another sample.

Here, I took a batch size of 10, but the default max_iter is 100 so at most 1000 samples have been seen, meaning at least 500 samples have not been seen.

Besides the inaccurate number of predicted clusters, it indicates a probably mediocre quality of the solution.

The solution would be either to increase your batch size (as you mentionned) or to increase the number of iterations (you’ll probably need to increase max_no_improvement as well in that case).