Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kmeans clustering centroids differ between 0.23.2 and 0.24.1

See original GitHub issue

Describe the bug

Clustering data with shape [540, 128] into 24 clusters using KMeans(24, random_state=42) gives different results.

I dont see anything in the KMeans documentation nor in the changelog that could explain the change.

The difference is still there in the nightly builds.

Steps/Code to Reproduce

I have provided a Colab notebook that can be configured with different pip installs. https://colab.research.google.com/drive/1paC6VbTmRJeEEAVFXVxuOVVn6ulvCAml?usp=sharing

Expected Results

The last centroid should start (based on output from v0.23) array([ 0.0943853 ...

Actual Results

Under the 0.24.1 and greater, the last centroid starts array([ 7.83300400e-02, -4.15086746e-02

I know its not about different ordering - we checked that (see the colab notebook).

This is having an impact on a downstream task. Is the role of the random_state different between the versions?

Versions

Successful:

System:
    python: 3.7.10 (default, Feb 20 2021, 21:17:23)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 56.0.0
      sklearn: 0.23.2
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.22
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Failing:

System:
    python: 3.7.10 (default, Feb 20 2021, 21:17:23)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 56.0.0
      sklearn: 0.24.1
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.22
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

(These outputs are from colab, we also verified the bug on a private Anaconda environment).

Issue Analytics

State:
Created 2 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

2reactions

jeremiedbbcommented, Apr 27, 2021

The reason of the change comes from https://github.com/scikit-learn/scikit-learn/pull/17928/. The change comes from the fact that we consume the random_state differently, so only rng is impacted. We forgot to report that in the changelog.

1reaction

glemaitrecommented, Apr 27, 2021

Closing after merging the missing note. Thanks @cmacdonald for reporting.

Top Results From Across the Web

sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation

It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the bestcentroid among them.

Time Series Clustering [Store Sales] - Kaggle

It is important how we define the similarity between observations in a feature space. When using KMeans we can use: Euclidean distance.

K-means Cluster Analysis

In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster....

7 Most Asked Questions on K-Means Clustering | by Aaron Zhu

K-Means uses distance-based measurements (e.g., Euclidean Distance) to calculate how similar each data point is to centroids using values from all the features....

Balancing effort and benefit of K-means clustering algorithms ...

In the paper by Salman et al. [22] the initial centroids are those obtained as final by k-means when applied on a 0.1n-size,...