knn predict unreasonably slow b/c of use of scipy.stats.mode
See original GitHub issueimport numpy as np
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
X, y = make_blobs(centers=2, random_state=4, n_samples=30)
knn = KNeighborsClassifier(algorithm='kd_tree').fit(X, y)
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()
xx = np.linspace(x_min, x_max, 1000)
# change 100 to 1000 below and wait a long time
yy = np.linspace(y_min, y_max, 100)
X1, X2 = np.meshgrid(xx, yy)
X_grid = np.c_[X1.ravel(), X2.ravel()]
decision_values = knn.predict(X_grid)
spends all it’s time in unique within stats.mode, not within the distance calculation. mode runs unique for every row.
I’m pretty sure we can replace the call to mode by some call to making a csr matrix and then argmax.
How much is it worth optimizing this? I feel KNN should be fast in low dimensions and people might actually use this. Having the bottleneck in the wrong place just feels wrong to me 😉
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
scipy.stats.mode — SciPy v1.9.3 Manual
In SciPy 1.11. 0, this behavior will change: the default value of keepdims will become False , the axis over which the statistic...
Read more >Machine Learning Basics with the K-Nearest Neighbors ...
KNN's main disadvantage of becoming significantly slower as the volume of data increases makes it an impractical choice in environments where ...
Read more >Is k nearest neighbours regression inherently slow?
I am trying to use k nearest neighbours implementation from scikit learn on a fairly large dataset. The problem is that predictions take...
Read more >Hands-on Machine Learning: Scikit-Learn - A Hugo website
Another way to generalize from a set of examples is to build a model of these examples and then use that model to...
Read more >arXiv:2008.12065v1 [cs.LG] 27 Aug 2020
Several measures were used to evaluate the performance of models in predicting propensity-to-pay. Re- sults show that machine learning ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

At https://github.com/scikit-learn/scikit-learn/pull/9597#issuecomment-424379575, @TomDLT pointed out that argmax of
predict_probais faster than the currentpredictimplementation. Any proposal here should compare to using that approach (not yet implemented there) and avoiding mode altogether.Yes, I need to finish https://github.com/scikit-learn/scikit-learn/pull/14543 to fix it