Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`silhouette_score` fails whit `"k-means++"` init on very sparse data

See original GitHub issue

Describe the bug

Not sure this is a bug or something that just need to be better documented.

The "k-means++" init on very sparse data can select initial centroids than never get updated and this cause problems in the silhouette clustering evaluation with sample_size=1000.

The problem disappears when sample_size is increased or by setting particular random states.

Steps/Code to Reproduce

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn import metrics

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=2,
    stop_words="english",
)

X = vectorizer.fit_transform(dataset.data)

km = KMeans(
    n_clusters=4,
    init="k-means++",
    max_iter=100,
    n_init=1,
    random_state=0,  #notice fixed random state
).fit(X)


metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=3) # works

metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=4) # fails

Expected Results

Always return silhouette score.

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Untitled-1 in <module>
----> <a href='untitled:Untitled-1?line=37'>38</a> metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=43)

File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:117, in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
    [115](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=114)     else:
    [116](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=115)         X, labels = X[indices], labels[indices]
--> [117](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=116) return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))

File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:231, in silhouette_samples(X, labels, metric, **kwds)
    [229](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=228) n_samples = len(labels)
    [230](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=229) label_freqs = np.bincount(labels)
--> [231](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=230) check_number_of_labels(len(le.classes_), n_samples)
    [233](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=232) kwds["metric"] = metric
    [234](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=233) reduce_func = functools.partial(
    [235](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=234)     _silhouette_reduce, labels=labels, label_freqs=label_freqs
    [236](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=235) )

File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:33, in check_number_of_labels(n_labels, n_samples)
     [22](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=21) """Check that number of labels are valid.
     [23](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=22) 
     [24](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=23) Parameters
   (...)
     [30](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=29)     Number of samples.
     [31](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=30) """
     [32](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=31) if not 1 < n_labels < n_samples:
---> [33](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=32)     raise ValueError(
     [34](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=33)         "Number of labels is %d. Valid values are 2 to n_samples - 1 (inclusive)"
     [35](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=34)         % n_labels
     [36](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=35)     )

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

Versions

System:
    python: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:40:17)  [GCC 9.4.0]
executable: /home/arturoamor/miniforge3/envs/dev-scikit-learn/bin/python
   machine: Linux-5.14.0-1038-oem-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.dev0
          pip: 21.3.1
   setuptools: 60.5.0
        numpy: 1.22.1
        scipy: 1.7.3
       Cython: 0.29.26
       pandas: 1.3.5
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-2f7c42d4.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
        version: 0.3.17
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/libgomp.so.1.0.0
        version: None
    num_threads: 8

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

ArturoAmorQcommented, Jun 8, 2022

Since random_state is not set in KMeans in the reproducer, I can’t reproduce it yet. We need to find a seed that trigger it

I updated it.

0reactions

jeremiedbbcommented, Jun 8, 2022

Looking at the results for different seeds of KMeans(n_clusters=4, init="k-means++", max_iter=100, n_init=1, random_state=seed).fit(X), this kind of unbalance happens sometimes. I guess it’s because kmeans++ picks centroids with proba inverse to square distance, which means it will pick points extremely isolated if there are any, which then stay their own centroids all along.

Note that it doesn’t mean it’s a bad clustering. The inertia is close to the inertia obtained with a random init. It’s just kind of reflecting the structure of the dataset.

It doesn’t look like there’s an issue with kmeans++. It’s just that KMeans (euclidean distance in fact) is not really good for such high dimensional dataset.

Top Results From Across the Web

sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation

Several runs are recommended for sparse high-dimensional problems (see Clustering sparse data with k-means). When n_init='auto' , the number of runs will be ......

Building sharp regression models with K-Means Clustering + ...

Then performing K- Means clustering on the train data. K-Means can be hypertuned using the Silhouette Coefficient score, which will help us get...

When K-Means Clustering Fails: Alternatives for Segmenting ...

If the number of clusters is not known a priori, we can estimate it using the silhouette score to determine the best number....

Scikit Learn - K-Means - Elbow - criterion - Stack Overflow

A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample ...

K-Means Clustering in Python: A Practical Guide - Real Python

Clustering is a set of techniques used to partition data into groups, or clusters. ... The silhouette score() function needs a minimum of...