question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kmeans clustering centroids differ between 0.23.2 and 0.24.1

See original GitHub issue

Describe the bug

Clustering data with shape [540, 128] into 24 clusters using KMeans(24, random_state=42) gives different results.

I dont see anything in the KMeans documentation nor in the changelog that could explain the change.

The difference is still there in the nightly builds.

Steps/Code to Reproduce

I have provided a Colab notebook that can be configured with different pip installs. https://colab.research.google.com/drive/1paC6VbTmRJeEEAVFXVxuOVVn6ulvCAml?usp=sharing

Expected Results

The last centroid should start (based on output from v0.23) array([ 0.0943853 ...

Actual Results

Under the 0.24.1 and greater, the last centroid starts array([ 7.83300400e-02, -4.15086746e-02

I know its not about different ordering - we checked that (see the colab notebook).

This is having an impact on a downstream task. Is the role of the random_state different between the versions?

Versions

Successful:

System:
    python: 3.7.10 (default, Feb 20 2021, 21:17:23)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 56.0.0
      sklearn: 0.23.2
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.22
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Failing:

System:
    python: 3.7.10 (default, Feb 20 2021, 21:17:23)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 56.0.0
      sklearn: 0.24.1
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.22
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

(These outputs are from colab, we also verified the bug on a private Anaconda environment).

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
jeremiedbbcommented, Apr 27, 2021

The reason of the change comes from https://github.com/scikit-learn/scikit-learn/pull/17928/. The change comes from the fact that we consume the random_state differently, so only rng is impacted. We forgot to report that in the changelog.

1reaction
glemaitrecommented, Apr 27, 2021

Closing after merging the missing note. Thanks @cmacdonald for reporting.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation
It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the bestcentroid among them.
Read more >
Time Series Clustering [Store Sales] - Kaggle
It is important how we define the similarity between observations in a feature space. When using KMeans we can use: Euclidean distance.
Read more >
K-means Cluster Analysis
In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster....
Read more >
7 Most Asked Questions on K-Means Clustering | by Aaron Zhu
K-Means uses distance-based measurements (e.g., Euclidean Distance) to calculate how similar each data point is to centroids using values from all the features....
Read more >
Balancing effort and benefit of K-means clustering algorithms ...
In the paper by Salman et al. [22] the initial centroids are those obtained as final by k-means when applied on a 0.1n-size,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found