`silhouette_score` fails whit `"k-means++"` init on very sparse data
See original GitHub issueDescribe the bug
Not sure this is a bug or something that just need to be better documented.
The "k-means++"
init on very sparse data can select initial centroids than never get updated and this cause problems in the silhouette clustering evaluation with sample_size=1000
.
The problem disappears when sample_size
is increased or by setting particular random states.
Steps/Code to Reproduce
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn import metrics
categories = [
"alt.atheism",
"talk.religion.misc",
"comp.graphics",
"sci.space",
]
dataset = fetch_20newsgroups(
remove=("headers", "footers", "quotes"),
subset="all",
categories=categories,
shuffle=True,
random_state=42,
)
vectorizer = TfidfVectorizer(
max_df=0.5,
min_df=2,
stop_words="english",
)
X = vectorizer.fit_transform(dataset.data)
km = KMeans(
n_clusters=4,
init="k-means++",
max_iter=100,
n_init=1,
random_state=0, #notice fixed random state
).fit(X)
metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=3) # works
metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=4) # fails
Expected Results
Always return silhouette score.
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Untitled-1 in <module>
----> <a href='untitled:Untitled-1?line=37'>38</a> metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=43)
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:117, in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
[115](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=114) else:
[116](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=115) X, labels = X[indices], labels[indices]
--> [117](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=116) return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:231, in silhouette_samples(X, labels, metric, **kwds)
[229](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=228) n_samples = len(labels)
[230](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=229) label_freqs = np.bincount(labels)
--> [231](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=230) check_number_of_labels(len(le.classes_), n_samples)
[233](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=232) kwds["metric"] = metric
[234](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=233) reduce_func = functools.partial(
[235](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=234) _silhouette_reduce, labels=labels, label_freqs=label_freqs
[236](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=235) )
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:33, in check_number_of_labels(n_labels, n_samples)
[22](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=21) """Check that number of labels are valid.
[23](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=22)
[24](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=23) Parameters
(...)
[30](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=29) Number of samples.
[31](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=30) """
[32](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=31) if not 1 < n_labels < n_samples:
---> [33](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=32) raise ValueError(
[34](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=33) "Number of labels is %d. Valid values are 2 to n_samples - 1 (inclusive)"
[35](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=34) % n_labels
[36](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=35) )
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
Versions
System:
python: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:40:17) [GCC 9.4.0]
executable: /home/arturoamor/miniforge3/envs/dev-scikit-learn/bin/python
machine: Linux-5.14.0-1038-oem-x86_64-with-glibc2.31
Python dependencies:
sklearn: 1.2.dev0
pip: 21.3.1
setuptools: 60.5.0
numpy: 1.22.1
scipy: 1.7.3
Cython: 0.29.26
pandas: 1.3.5
matplotlib: 3.5.1
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-2f7c42d4.3.18.so
version: 0.3.18
threading_layer: pthreads
architecture: SkylakeX
num_threads: 8
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
version: 0.3.17
threading_layer: pthreads
architecture: SkylakeX
num_threads: 8
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/libgomp.so.1.0.0
version: None
num_threads: 8
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation
Several runs are recommended for sparse high-dimensional problems (see Clustering sparse data with k-means). When n_init='auto' , the number of runs will be ......
Read more >Building sharp regression models with K-Means Clustering + ...
Then performing K- Means clustering on the train data. K-Means can be hypertuned using the Silhouette Coefficient score, which will help us get...
Read more >When K-Means Clustering Fails: Alternatives for Segmenting ...
If the number of clusters is not known a priori, we can estimate it using the silhouette score to determine the best number....
Read more >Scikit Learn - K-Means - Elbow - criterion - Stack Overflow
A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample ...
Read more >K-Means Clustering in Python: A Practical Guide - Real Python
Clustering is a set of techniques used to partition data into groups, or clusters. ... The silhouette score() function needs a minimum of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I updated it.
Looking at the results for different seeds of
KMeans(n_clusters=4, init="k-means++", max_iter=100, n_init=1, random_state=seed).fit(X)
, this kind of unbalance happens sometimes. I guess it’s because kmeans++ picks centroids with proba inverse to square distance, which means it will pick points extremely isolated if there are any, which then stay their own centroids all along.Note that it doesn’t mean it’s a bad clustering. The inertia is close to the inertia obtained with a random init. It’s just kind of reflecting the structure of the dataset.
It doesn’t look like there’s an issue with kmeans++. It’s just that KMeans (euclidean distance in fact) is not really good for such high dimensional dataset.