Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Poor performance of KNeighborsClassifier for sklearn version >=0.24.2

See original GitHub issue

Describe the bug

Hi. I used sklearn.neighbors.KNeighborsClassifier in my program but I noticed that its runtime increased when adopting higher versions of sklearn.

As shown in the following experiment results, this model performs best if we use sklearn <0.24.2.

My question is that why does such a library version perform better? Are there any issues in newer versions?

The detailed information is as follows:

Runtime	Memory	Version
451.2495478	2337.8830499649	1.0.1
441.4340003	2338.7937555313	0.24.2
26.25025550	414.1812286377	0.23.2
26.42619310	408.6705312729	0.22.1
26.42619310	408.6052923203	0.22
31.0837617	409.0471181870	0.21.3
29.0283316	408.5676021576	0.20.3
22.6652508	409.7873554230	0.19.2

My program only used the apis of KNeighborsClassifier provided by sklearn.

knn = KNeighborsClassifier(n_neighbors = 3)

Steps/Code to Reproduce

I create a minor program that used KNeighborsClassifier and you can test it on colab. The runtime is still higher when adopting the newer sklearn version on this example.

train.csv

Using the above train.csv and the following snippet:

# %%
# %%
import pandas as pd

train_data = pd.read_csv("train.csv")
total = train_data.isnull().sum().sort_values(ascending=False)
percent_1 = train_data.isnull().sum() / train_data.isnull().count() * 100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=["Total", "%"])
cor = train_data.corr()
cor_target = abs(cor["target"])
relevant_features = cor_target[cor_target > 0.5]
s = train_data.dtypes == "object"
train_data_cat_var = list(s[s].index)
train_data.drop(
    ["nom_5", "nom_6", "nom_7", "nom_8", "nom_9", "ord_3", "ord_4", "ord_5"],
    axis=1,
    inplace=True,
)
train_data_cat_var = [
    ele
    for ele in train_data_cat_var
    if ele
    not in ["nom_5", "nom_6", "nom_7", "nom_8", "nom_9", "ord_3", "ord_4", "ord_5"]
]
final_train_data = pd.get_dummies(
    train_data, columns=train_data_cat_var, drop_first=True
)
features = final_train_data.drop(["target"], axis=1).columns
target = final_train_data["target"]

# %%
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    final_train_data[features], target, test_size=0.2
)

# %%
from sklearn.neighbors import KNeighborsClassifier

# %%
%%time
knn = KNeighborsClassifier(n_neighbors=3, algorithm="brute")
knn.fit(X_train, y_train)
Y_pred_knn = knn.predict(X_test)

Expected Results

Similar runtime for the different sklearn versions or better performance for higher sklearn versions.

Actual Results

According to the above experiment results, when sklearn version is 0.24.2 or 1.0.1, the performance of the model is worse.

Versions

I test my program with sklearn 1.0.1, 0.24.2, 0.23.2, 0.22.1, 0.22, 0.21.3, 0.20.3 and 0.19.2.

Issue Analytics

State:
Created a year ago
Comments:7 (6 by maintainers)

Top GitHub Comments

3reactions

glemaitrecommented, May 20, 2022

Can you provide a minimal example that we can copy-paste. Right now the data are located to your drive.

Could you also try the last release of scikit-learn 1.1.1. We improve the scaling of the NN algorithm (even thought the regression shown seems really weird).

2reactions

DSOTM-pfcommented, May 24, 2022

Hello, Thanks for your quick reply. I test the snippet with different sklearn versions and different algorithms . The detailed information is as follows:

Python version	Sklearn version	Runtime(s)	Peak memory(MB)	Algo
3.8.13	1.1.1	89.47	2205.30	brute
3.8.13	1.0.2	90.31	2205.85	brute
3.7.10	1.0.2	86.89	2206.15	brute
3.8.13	1.1.1	3.86	233.28	ball_tree
3.8.13	1.0.2	4.35	232.80	ball_tree
3.7.10	1.0.2	3.56	233.05	ball_tree

I print the peak memory by tracemalloc.get_traced_memory() . The algorithm brute seems to cause the runtime and memory to increase.

Device info

OS : Ubuntu CPU : Intel® Core™ i9-9900K CPU GPU : TITAN V

Top Results From Across the Web

Poor performance of KNeighborsClassifier for sklearn ... - GitHub

Poor performance of KNeighborsClassifier for sklearn version >=0.24.2 Poor performance of KNeighborsClassifier for sklearn version >=0.24.2. Assign #29559 ...

Version 0.24.2 — scikit-learn 1.2.0 documentation

The version of libomp used to build the wheels was too recent for older macOS versions. This issue has been fixed for 0.24.1...

How to upgrade the classifier to the latest version of scikit-learn

Is there a natural way to upgrade my TfidfVectorizer to prevent any problems? Should I better uninstall scikit-learn 0.18.1 and install version 0.18...

KNN Classification Tutorial using Sklearn Python - DataCamp

from sklearn.neighbors import KNeighborsClassifier model ... To understand model performance, dividing the dataset into a training set and a test set is a ......

Introduction to scikit-learn - IT4Innovations events (Indico)

Current version: 0.24.2 ... Benefits from performance optimizations of BLAS, FFT, ... scikit-learn available as Python module:.