question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Change KMeans n_init default value to 1.

See original GitHub issue

As mentioned in #9430, pydaal gets a speedup of 30x, mostly because they ignore n_init. I feel like n_init=10 is a pretty odd choice. We don’t really do random restarts anywhere else by default (afaik), and in the age of large datasets this is really pretty expensive.

Not sure if it’s worth a deprecation cycle but this is pretty non-obvious behavior that potentially makes us look bad.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:17 (16 by maintainers)

github_iconTop GitHub Comments

2reactions
thomasjpfancommented, Apr 6, 2022

In the example: Empirical evaluation of the impact of k-means initialization, it does show that n_init > 1 leads to an improvement for init="random". For k-means++, n_init does not make a difference. If we go by the example, then we can have a n_init="auto", where: n_init=1 for kmeans++ and n_init=10 for random.

1reaction
jnothmancommented, Sep 11, 2017

Yes, let’s change it. Or default to have an inverse relationship with dataset size.

On 12 September 2017 at 00:55, Andreas Mueller notifications@github.com wrote:

As mentioned in #9430 https://github.com/scikit-learn/scikit-learn/issues/9430, pydaal gets a speedup of 30x, mostly because they ignore n_init. I feel like n_init=10 is a pretty odd choice. We don’t really do random restarts anywhere else by default (afaik), and in the age of large datasets this is really pretty expensive.

Not sure if it’s worth a deprecation cycle but this is pretty non-obvious behavior that potentially makes us look bad.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/9729, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6zVQKD2vCDZr95LKaUQ7JLHwS6vUks5shUnagaJpZM4PTSrW .

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation
Changed in version 1.4: Default value for n_init will change from 10 to 'auto' in version 1.4. max_iterint, default=300. Maximum number of iterations...
Read more >
initial centroids for scikit-learn kmeans clustering
Yes, setting initial centroids via init should work. Here's a quote from scikit-learn documentation: init : {'k-means++', 'random' or an ...
Read more >
KMeans Hyper-parameters Explained with Examples
There are other methods for unsupervised clustering, such as DBScan, ... n_init = By default is 10 and so the algorithm will initialize...
Read more >
K-Means Optimization & Parameters - HolyPython.com
n_init : (default: 10) Another significant parameter n_init is used to define the number of initialization attempts for centroids of clusters.
Read more >
In Depth: k-Means Clustering | Python Data Science Handbook
The k-means algorithm searches for a pre-determined number of clusters within an ... as indeed Scikit-Learn does by default (set by the n_init...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found