Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Change KMeans n_init default value to 1.

See original GitHub issue

As mentioned in #9430, pydaal gets a speedup of 30x, mostly because they ignore n_init. I feel like n_init=10 is a pretty odd choice. We don’t really do random restarts anywhere else by default (afaik), and in the age of large datasets this is really pretty expensive.

Not sure if it’s worth a deprecation cycle but this is pretty non-obvious behavior that potentially makes us look bad.

Issue Analytics

State:
Created 6 years ago
Comments:17 (16 by maintainers)

Top GitHub Comments

2reactions

thomasjpfancommented, Apr 6, 2022

In the example: Empirical evaluation of the impact of k-means initialization, it does show that n_init > 1 leads to an improvement for init="random". For k-means++, n_init does not make a difference. If we go by the example, then we can have a n_init="auto", where: n_init=1 for kmeans++ and n_init=10 for random.

1reaction

jnothmancommented, Sep 11, 2017

Yes, let’s change it. Or default to have an inverse relationship with dataset size.

On 12 September 2017 at 00:55, Andreas Mueller notifications@github.com wrote:

As mentioned in #9430 https://github.com/scikit-learn/scikit-learn/issues/9430, pydaal gets a speedup of 30x, mostly because they ignore n_init. I feel like n_init=10 is a pretty odd choice. We don’t really do random restarts anywhere else by default (afaik), and in the age of large datasets this is really pretty expensive.

Not sure if it’s worth a deprecation cycle but this is pretty non-obvious behavior that potentially makes us look bad.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/9729, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6zVQKD2vCDZr95LKaUQ7JLHwS6vUks5shUnagaJpZM4PTSrW .

Top Results From Across the Web

sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation

Changed in version 1.4: Default value for n_init will change from 10 to 'auto' in version 1.4. max_iterint, default=300. Maximum number of iterations...

initial centroids for scikit-learn kmeans clustering

Yes, setting initial centroids via init should work. Here's a quote from scikit-learn documentation: init : {'k-means++', 'random' or an ...

KMeans Hyper-parameters Explained with Examples

There are other methods for unsupervised clustering, such as DBScan, ... n_init = By default is 10 and so the algorithm will initialize...

K-Means Optimization & Parameters - HolyPython.com

n_init : (default: 10) Another significant parameter n_init is used to define the number of initialization attempts for centroids of clusters.

In Depth: k-Means Clustering | Python Data Science Handbook

The k-means algorithm searches for a pre-determined number of clusters within an ... as indeed Scikit-Learn does by default (set by the n_init...