Kmeans clustering centroids differ between 0.23.2 and 0.24.1
See original GitHub issueDescribe the bug
Clustering data with shape [540, 128] into 24 clusters using KMeans(24, random_state=42)
gives different results.
I dont see anything in the KMeans documentation nor in the changelog that could explain the change.
The difference is still there in the nightly builds.
Steps/Code to Reproduce
I have provided a Colab notebook that can be configured with different pip installs. https://colab.research.google.com/drive/1paC6VbTmRJeEEAVFXVxuOVVn6ulvCAml?usp=sharing
Expected Results
The last centroid should start (based on output from v0.23)
array([ 0.0943853 ...
Actual Results
Under the 0.24.1 and greater, the last centroid starts
array([ 7.83300400e-02, -4.15086746e-02
I know its not about different ordering - we checked that (see the colab notebook).
This is having an impact on a downstream task. Is the role of the random_state different between the versions?
Versions
Successful:
System:
python: 3.7.10 (default, Feb 20 2021, 21:17:23) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 19.3.1
setuptools: 56.0.0
sklearn: 0.23.2
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.22
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True
Failing:
System:
python: 3.7.10 (default, Feb 20 2021, 21:17:23) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 19.3.1
setuptools: 56.0.0
sklearn: 0.24.1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.22
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True
(These outputs are from colab, we also verified the bug on a private Anaconda environment).
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
The reason of the change comes from https://github.com/scikit-learn/scikit-learn/pull/17928/. The change comes from the fact that we consume the random_state differently, so only rng is impacted. We forgot to report that in the changelog.
Closing after merging the missing note. Thanks @cmacdonald for reporting.