Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

knn predict unreasonably slow b/c of use of scipy.stats.mode

See original GitHub issue

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(centers=2, random_state=4, n_samples=30)
knn = KNeighborsClassifier(algorithm='kd_tree').fit(X, y)

x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()

xx = np.linspace(x_min, x_max, 1000)
# change 100 to 1000 below and wait a long time                                          
yy = np.linspace(y_min, y_max, 100)                                          

X1, X2 = np.meshgrid(xx, yy)                                                  
X_grid = np.c_[X1.ravel(), X2.ravel()]                                        
decision_values = knn.predict(X_grid)

spends all it’s time in unique within stats.mode, not within the distance calculation. mode runs unique for every row. I’m pretty sure we can replace the call to mode by some call to making a csr matrix and then argmax.

How much is it worth optimizing this? I feel KNN should be fast in low dimensions and people might actually use this. Having the bottleneck in the wrong place just feels wrong to me 😉

Issue Analytics

State:
Created 4 years ago
Comments:11 (10 by maintainers)

Top GitHub Comments

2reactions

jnothmancommented, Aug 2, 2019

At https://github.com/scikit-learn/scikit-learn/pull/9597#issuecomment-424379575, @TomDLT pointed out that argmax of predict_proba is faster than the current predict implementation. Any proposal here should compare to using that approach (not yet implemented there) and avoiding mode altogether.

1reaction

rthcommented, Aug 26, 2020

Yes, I need to finish https://github.com/scikit-learn/scikit-learn/pull/14543 to fix it