question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

knn predict unreasonably slow b/c of use of scipy.stats.mode

See original GitHub issue
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(centers=2, random_state=4, n_samples=30)
knn = KNeighborsClassifier(algorithm='kd_tree').fit(X, y)

x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()

xx = np.linspace(x_min, x_max, 1000)
# change 100 to 1000 below and wait a long time                                          
yy = np.linspace(y_min, y_max, 100)                                          

X1, X2 = np.meshgrid(xx, yy)                                                  
X_grid = np.c_[X1.ravel(), X2.ravel()]                                        
decision_values = knn.predict(X_grid)

spends all it’s time in unique within stats.mode, not within the distance calculation. mode runs unique for every row. I’m pretty sure we can replace the call to mode by some call to making a csr matrix and then argmax.

How much is it worth optimizing this? I feel KNN should be fast in low dimensions and people might actually use this. Having the bottleneck in the wrong place just feels wrong to me 😉

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
jnothmancommented, Aug 2, 2019

At https://github.com/scikit-learn/scikit-learn/pull/9597#issuecomment-424379575, @TomDLT pointed out that argmax of predict_proba is faster than the current predict implementation. Any proposal here should compare to using that approach (not yet implemented there) and avoiding mode altogether.

1reaction
rthcommented, Aug 26, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

scipy.stats.mode — SciPy v1.9.3 Manual
In SciPy 1.11. 0, this behavior will change: the default value of keepdims will become False , the axis over which the statistic...
Read more >
Machine Learning Basics with the K-Nearest Neighbors ...
KNN's main disadvantage of becoming significantly slower as the volume of data increases makes it an impractical choice in environments where ...
Read more >
Is k nearest neighbours regression inherently slow?
I am trying to use k nearest neighbours implementation from scikit learn on a fairly large dataset. The problem is that predictions take...
Read more >
Hands-on Machine Learning: Scikit-Learn - A Hugo website
Another way to generalize from a set of examples is to build a model of these examples and then use that model to...
Read more >
arXiv:2008.12065v1 [cs.LG] 27 Aug 2020
Several measures were used to evaluate the performance of models in predicting propensity-to-pay. Re- sults show that machine learning ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found