question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KMeans Elkan algorithm (the default) is generally slower than Full

See original GitHub issue

Describe the bug

I just ran some tests on KMeans, and using algorithm="elkan" is generally slower (and sometimes much slower) than algorithm="full" on various datasets generated using make_blobs() (I tried different n_samples, n_features, cluster_std, and different numbers of clusters). I basically couldn’t find a dataset where Elkan was reliably faster than Full. I also tried both NumPy 1.18.5 and 1.19.5, with Scikit-Learn 1.0.1, but Elkan remained generally slower than Full.

Steps/Code to Reproduce

The experiments were done on Colab, like this:

from time import time

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

def time_kmeans(X, n_clusters, algorithm, random_state):
  kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state)
  t0 = time()
  kmeans.fit(X)
  return time() - t0

iteration = 0
for n_samples in (10**3, 10**4, 10**5):
  for n_features in (2, 10, 100):
    for n_clusters in (2, 5, 10, 20, 50, 100):
      for cluster_std in (1e-3, 0.1, 1, 5):
        n_samples_per_cluster = n_samples // n_clusters
        X, y = make_blobs(n_samples=[n_samples_per_cluster] * n_clusters,
                          n_features=n_features,
                          cluster_std=cluster_std)
        iteration += 1
        time_elkan = time_kmeans(X, n_clusters, algorithm="elkan", random_state=iteration)
        time_full = time_kmeans(X, n_clusters, algorithm="full", random_state=iteration)
        winner = "Full" if time_full < time_elkan else "Elkan"
        print(
            f"n_samples={n_samples}, n_features={n_features}, "
            f"n_clusters={n_clusters}, cluster_std={cluster_std}, "
            f"time_elkan={time_elkan:.2f}, time_full={time_full:.2f}, "
            f"winner={winner}")

And the output is:

n_samples=1000, n_features=2, n_clusters=2, cluster_std=0.001, time_elkan=0.01, time_full=0.01, winner=Full
n_samples=1000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=0.11, time_full=0.07, winner=Full
n_samples=1000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=0.19, time_full=0.15, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=0.49, time_full=0.31, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=0.73, time_full=0.29, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=1.17, time_full=0.32, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=1.21, time_full=0.30, winner=Full
n_samples=1000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.12, time_full=0.12, winner=Full
n_samples=1000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.09, time_full=0.08, winner=Full
n_samples=1000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.13, time_full=0.23, winner=Elkan
n_samples=1000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=0.49, time_full=0.63, winner=Elkan
n_samples=1000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.48, time_full=0.24, winner=Full
n_samples=1000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=0.48, time_full=0.26, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=0.75, time_full=0.42, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=0.61, time_full=0.43, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=0.88, time_full=0.53, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=1.36, time_full=0.46, winner=Full
n_samples=1000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.08, time_full=0.15, winner=Elkan
n_samples=1000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.16, time_full=0.07, winner=Full
n_samples=1000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.27, time_full=0.28, winner=Elkan
n_samples=1000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.21, time_full=0.19, winner=Full
n_samples=1000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=0.25, time_full=0.26, winner=Elkan
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=0.76, time_full=0.61, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.43, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=0.59, time_full=0.61, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=0.93, time_full=0.77, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=0.84, time_full=0.96, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=1.37, time_full=0.94, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=1.31, time_full=0.79, winner=Full
n_samples=10000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=0.14, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=0.31, time_full=0.18, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.29, time_full=0.16, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=0.53, time_full=0.21, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=1.27, time_full=0.83, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=0.25, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=1.17, time_full=0.58, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=1.60, time_full=0.86, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=0.79, time_full=0.84, winner=Elkan
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=0.82, time_full=0.70, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=1.99, time_full=1.34, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=2.35, time_full=1.47, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=1.12, time_full=1.06, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=1.34, time_full=1.08, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=7.00, time_full=1.95, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=7.06, time_full=1.92, winner=Full
n_samples=10000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.13, time_full=0.24, winner=Elkan
n_samples=10000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=0.40, time_full=0.32, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.30, time_full=0.17, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=0.70, time_full=0.43, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.46, time_full=0.27, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=0.48, time_full=0.51, winner=Elkan
n_samples=10000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=1.39, time_full=1.01, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=0.78, time_full=0.59, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.57, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=1.00, time_full=0.90, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=3.47, time_full=2.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=1.29, time_full=1.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=1.45, time_full=1.36, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=2.58, time_full=1.63, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=11.02, time_full=3.39, winner=Full
n_samples=10000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=0.34, time_full=0.22, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.34, time_full=0.37, winner=Elkan
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=0.61, time_full=0.33, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.50, time_full=0.56, winner=Elkan
n_samples=10000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=0.59, time_full=0.47, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.63, time_full=0.53, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=1.02, time_full=1.10, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=0.82, time_full=0.81, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.82, time_full=0.89, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=0.92, time_full=0.78, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=1.77, time_full=1.56, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=2.10, time_full=1.71, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=1.93, time_full=1.82, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=2.10, time_full=1.77, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=5, time_elkan=2.89, time_full=2.49, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=4.13, time_full=3.24, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=3.97, time_full=3.15, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=4.20, time_full=3.46, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=5.92, time_full=4.10, winner=Full
n_samples=100000, n_features=2, n_clusters=2, cluster_std=0.1, time_elkan=0.30, time_full=0.38, winner=Elkan
n_samples=100000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=1.31, time_full=1.18, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=0.1, time_elkan=0.46, time_full=0.47, winner=Elkan
n_samples=100000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=1.16, time_full=0.63, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=5, time_elkan=2.77, time_full=1.77, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.97, time_full=0.71, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.1, time_elkan=0.80, time_full=0.73, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=2.18, time_full=1.16, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=4.82, time_full=2.35, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.001, time_elkan=1.36, time_full=1.23, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=1.66, time_full=1.44, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=4.42, time_full=2.82, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=11.02, time_full=4.66, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=3.61, time_full=3.55, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=3.89, time_full=3.49, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=18.86, time_full=9.43, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=27.77, time_full=12.31, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=8.33, time_full=7.79, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=9.55, time_full=8.30, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=59.08, time_full=20.13, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=70.57, time_full=23.11, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=0.1, time_elkan=0.51, time_full=0.42, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=1, time_elkan=0.44, time_full=0.61, winner=Elkan
n_samples=100000, n_features=10, n_clusters=2, cluster_std=5, time_elkan=0.85, time_full=0.64, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.55, time_full=0.63, winner=Elkan
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.77, time_full=0.58, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=1, time_elkan=0.80, time_full=0.61, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=1.70, time_full=1.00, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.001, time_elkan=1.05, time_full=0.87, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=1.06, time_full=0.80, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=1, time_elkan=1.23, time_full=1.14, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=3.28, time_full=2.13, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.001, time_elkan=1.96, time_full=1.55, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=2.17, time_full=2.00, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=2.31, time_full=1.86, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=6.73, time_full=4.03, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=4.87, time_full=4.19, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=4.92, time_full=4.53, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=6.05, time_full=5.50, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=35.26, time_full=18.50, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=10.04, time_full=9.49, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=10.54, time_full=9.01, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=16.26, time_full=12.68, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=140.66, time_full=44.06, winner=Full
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.001, time_elkan=1.19, time_full=1.25, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.1, time_elkan=1.45, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=1.47, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=5, time_elkan=1.70, time_full=1.55, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=2.12, time_full=1.95, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=2.24, time_full=2.16, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=1, time_elkan=2.19, time_full=2.06, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=2.74, time_full=2.13, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.001, time_elkan=3.53, time_full=3.39, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=3.61, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=3.67, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=5.12, time_full=4.50, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=6.02, time_full=5.40, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=6.02, time_full=5.48, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=6.07, time_full=5.51, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=16.14, time_full=17.33, winner=Elkan
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=14.73, time_full=13.29, winner=Full
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=14.86, time_full=13.29, winner=Full
...

Expected Results

Elkan should be generally faster, especially on large datasets with many well-defined clusters.

Actual Results

image

Versions

System:
    python: 3.7.12 (default, Sep 10 2021, 00:21:48)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.1
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.24
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

Note: I first mentioned these results in https://github.com/scikit-learn/scikit-learn/issues/20642, but it turned out to be a different issue which is why I’m copying these results in a new issue.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
kno10commented, Aug 29, 2022

Improved benchmark, with real data (color histograms, c.f., https://doi.org/10.5281/zenodo.6355684 ), but still effectively only a single data set:

import pandas as pd
from time import time
from sklearn.cluster import KMeans

def time_kmeans(X, n_clusters, algorithm, random_state):
  kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state, tol=0, max_iter=1000)
  t0 = time()
  kmeans.fit(X)
  return time() - t0, kmeans.n_iter_

iteration = 0
for d in (8, 27, 64, 125, 216, 343, 512, 729, 1000):
  df = pd.read_csv("https://zenodo.org/record/6355684/files/aloi-%dd.csv.gz" % d, delimiter=" ", header=None)
  X=df.iloc[:,:-2].values
  assert X.shape == (110250, d), Xshape
  for k in (2, 5, 10, 20, 50, 100):
    iteration += 1
    time_elkan, iter_elkan = time_kmeans(X, k, algorithm="elkan", random_state=iteration)
    time_full, iter_full = time_kmeans(X, k, algorithm="full", random_state=iteration)
    winner = "Full" if time_full < time_elkan else "Elkan"
    print(
        f"n_iter={iter_elkan:-4}, n_iter={iter_full:-4}, "
        f"k={k:-4}, d={d:-4}, "
        f"time_elkan={time_elkan:-6.2f}, time_full={time_full:-6.2f}, "
        f"winner={winner}")

on colab, this gives (still running for the larger data sets)

n_iter=  16, n_iter=  16, k=   2, d=   8, time_elkan=  1.12, time_full=  1.13, winner=Elkan
n_iter=  91, n_iter=  91, k=   5, d=   8, time_elkan=  2.60, time_full=  2.35, winner=Full
n_iter=  78, n_iter=  78, k=  10, d=   8, time_elkan=  4.99, time_full=  3.84, winner=Full
n_iter=  80, n_iter=  80, k=  20, d=   8, time_elkan=  9.55, time_full=  6.26, winner=Full
n_iter= 160, n_iter= 160, k=  50, d=   8, time_elkan= 20.37, time_full= 12.33, winner=Full
n_iter= 101, n_iter= 101, k= 100, d=   8, time_elkan= 71.28, time_full= 26.39, winner=Full
n_iter=  21, n_iter=  21, k=   2, d=  27, time_elkan=  1.69, time_full=  2.95, winner=Elkan
n_iter=  55, n_iter=  55, k=   5, d=  27, time_elkan=  3.27, time_full=  4.12, winner=Elkan
n_iter=  73, n_iter=  73, k=  10, d=  27, time_elkan=  6.85, time_full=  7.89, winner=Elkan
n_iter=  81, n_iter=  81, k=  20, d=  27, time_elkan= 14.78, time_full= 14.79, winner=Elkan
n_iter= 199, n_iter= 199, k=  50, d=  27, time_elkan= 25.07, time_full= 22.89, winner=Full
n_iter= 132, n_iter= 132, k= 100, d=  27, time_elkan= 92.91, time_full= 48.75, winner=Full
n_iter=  20, n_iter=  20, k=   2, d=  64, time_elkan=  2.25, time_full=  3.52, winner=Elkan
n_iter=  40, n_iter=  40, k=   5, d=  64, time_elkan=  6.51, time_full= 11.48, winner=Elkan
n_iter=  88, n_iter=  88, k=  10, d=  64, time_elkan=  9.45, time_full= 14.38, winner=Elkan
n_iter= 125, n_iter= 125, k=  20, d=  64, time_elkan= 17.72, time_full= 22.30, winner=Elkan
n_iter= 114, n_iter= 114, k=  50, d=  64, time_elkan= 34.08, time_full= 43.57, winner=Elkan
n_iter= 134, n_iter= 134, k= 100, d=  64, time_elkan=105.27, time_full= 77.91, winner=Full
n_iter=  19, n_iter=  19, k=   2, d= 125, time_elkan=  3.17, time_full=  5.51, winner=Elkan
n_iter=  54, n_iter=  54, k=   5, d= 125, time_elkan=  9.96, time_full= 19.17, winner=Elkan
n_iter= 169, n_iter= 169, k=  10, d= 125, time_elkan= 16.45, time_full= 31.44, winner=Elkan
n_iter=  60, n_iter=  60, k=  20, d= 125, time_elkan= 21.45, time_full= 35.11, winner=Elkan
n_iter= 130, n_iter= 130, k=  50, d= 125, time_elkan= 49.37, time_full= 81.40, winner=Elkan
n_iter=  82, n_iter=  82, k= 100, d= 125, time_elkan=135.25, time_full=148.29, winner=Elkan
n_iter=  22, n_iter=  22, k=   2, d= 216, time_elkan=  5.58, time_full=  8.99, winner=Elkan
n_iter=  61, n_iter=  61, k=   5, d= 216, time_elkan= 16.80, time_full= 34.09, winner=Elkan
n_iter= 108, n_iter= 108, k=  10, d= 216, time_elkan= 22.85, time_full= 45.63, winner=Elkan
n_iter=  78, n_iter=  78, k=  20, d= 216, time_elkan= 33.35, time_full= 61.57, winner=Elkan
n_iter= 238, n_iter= 238, k=  50, d= 216, time_elkan= 57.04, time_full=113.56, winner=Elkan
n_iter= 240, n_iter= 240, k= 100, d= 216, time_elkan=140.52, time_full=207.79, winner=Elkan
n_iter=  22, n_iter=  22, k=   2, d= 343, time_elkan=  6.98, time_full= 13.22, winner=Elkan
n_iter= 116, n_iter= 116, k=   5, d= 343, time_elkan= 23.74, time_full= 52.03, winner=Elkan
n_iter=  52, n_iter=  52, k=  10, d= 343, time_elkan= 33.98, time_full= 72.24, winner=Elkan
n_iter= 137, n_iter= 137, k=  20, d= 343, time_elkan= 41.94, time_full= 84.31, winner=Elkan
n_iter= 103, n_iter= 103, k=  50, d= 343, time_elkan= 70.77, time_full=151.32, winner=Elkan
n_iter= 102, n_iter= 102, k= 100, d= 343, time_elkan=171.56, time_full=303.35, winner=Elkan
n_iter=  18, n_iter=  18, k=   2, d= 512, time_elkan=  9.29, time_full= 17.43, winner=Elkan
n_iter=  64, n_iter=  64, k=   5, d= 512, time_elkan= 24.56, time_full= 52.91, winner=Elkan
n_iter=  59, n_iter=  59, k=  10, d= 512, time_elkan= 56.67, time_full=128.65, winner=Elkan
n_iter= 140, n_iter= 140, k=  20, d= 512, time_elkan= 74.82, time_full=165.59, winner=Elkan
n_iter= 133, n_iter= 133, k=  50, d= 512, time_elkan= 98.23, time_full=232.96, winner=Elkan
n_iter=  98, n_iter=  98, k= 100, d= 512, time_elkan=242.41, time_full=512.89, winner=Elkan

which appears to align with earlier results that Elkan is better for high dimensionality, and low k (because of the many bounds). For larger k, Hamerly and its improvements are often better than Elkan.

1reaction
ogriselcommented, Dec 15, 2021

As explained here those benchmark results might be biased in favor of Lloyd because the Elkan implementation can suffer from the fact that it iteratively calls into 2 distinct threadpools (OpenBLAS native threadpool and OpenMP) when scipy is installed from the wheels.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Charles R. Twardy - On K-means Clustering - Micro.blog
Pfh found that the K-means algorithm in his R package put too many ... EM algorithm ( algorithm='full' ) is faster than the...
Read more >
sklearn.cluster.KMeans — scikit-learn 1.2.0 documentation
The "elkan" variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive ......
Read more >
k-means triangle inequality - Cross Validated - Stack Exchange
Details are in the original paper about acceleration: Charles Elkan. Using the triangle inequality to accelerate k-means.
Read more >
Large scale K-means clustering using GPUs | SpringerLink
Experiments show that ASB K-means can run up to 15x times faster than a standard GPU-based implementation of k-means, and it also outperforms ......
Read more >
Yinyang K-Means: A Drop-In Replacement of the Classic K ...
gorithm is one of the most widely used clustering algo- ... n, k,or d is large, the algorithm runs slow due to its...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found