question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Poor performance of KNeighborsClassifier for sklearn version >=0.24.2

See original GitHub issue

Describe the bug

Hi. I used sklearn.neighbors.KNeighborsClassifier in my program but I noticed that its runtime increased when adopting higher versions of sklearn.

As shown in the following experiment results, this model performs best if we use sklearn <0.24.2.

My question is that why does such a library version perform better? Are there any issues in newer versions?

The detailed information is as follows:

Runtime Memory Version
451.2495478 2337.8830499649 1.0.1
441.4340003 2338.7937555313 0.24.2
26.25025550 414.1812286377 0.23.2
26.42619310 408.6705312729 0.22.1
26.42619310 408.6052923203 0.22
31.0837617 409.0471181870 0.21.3
29.0283316 408.5676021576 0.20.3
22.6652508 409.7873554230 0.19.2

My program only used the apis of KNeighborsClassifier provided by sklearn.

knn = KNeighborsClassifier(n_neighbors = 3)

Steps/Code to Reproduce

I create a minor program that used KNeighborsClassifier and you can test it on colab. The runtime is still higher when adopting the newer sklearn version on this example.

train.csv

Using the above train.csv and the following snippet:

# %%
# %%
import pandas as pd

train_data = pd.read_csv("train.csv")
total = train_data.isnull().sum().sort_values(ascending=False)
percent_1 = train_data.isnull().sum() / train_data.isnull().count() * 100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=["Total", "%"])
cor = train_data.corr()
cor_target = abs(cor["target"])
relevant_features = cor_target[cor_target > 0.5]
s = train_data.dtypes == "object"
train_data_cat_var = list(s[s].index)
train_data.drop(
    ["nom_5", "nom_6", "nom_7", "nom_8", "nom_9", "ord_3", "ord_4", "ord_5"],
    axis=1,
    inplace=True,
)
train_data_cat_var = [
    ele
    for ele in train_data_cat_var
    if ele
    not in ["nom_5", "nom_6", "nom_7", "nom_8", "nom_9", "ord_3", "ord_4", "ord_5"]
]
final_train_data = pd.get_dummies(
    train_data, columns=train_data_cat_var, drop_first=True
)
features = final_train_data.drop(["target"], axis=1).columns
target = final_train_data["target"]

# %%
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    final_train_data[features], target, test_size=0.2
)

# %%
from sklearn.neighbors import KNeighborsClassifier

# %%
%%time
knn = KNeighborsClassifier(n_neighbors=3, algorithm="brute")
knn.fit(X_train, y_train)
Y_pred_knn = knn.predict(X_test)

image-20220520110041914

Expected Results

Similar runtime for the different sklearn versions or better performance for higher sklearn versions.

Actual Results

According to the above experiment results, when sklearn version is 0.24.2 or 1.0.1, the performance of the model is worse.

Versions

I test my program with sklearn 1.0.1, 0.24.2, 0.23.2, 0.22.1, 0.22, 0.21.3, 0.20.3 and 0.19.2.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
glemaitrecommented, May 20, 2022

Can you provide a minimal example that we can copy-paste. Right now the data are located to your drive.

Could you also try the last release of scikit-learn 1.1.1. We improve the scaling of the NN algorithm (even thought the regression shown seems really weird).

2reactions
DSOTM-pfcommented, May 24, 2022

Hello, Thanks for your quick reply. I test the snippet with different sklearn versions and different algorithms . The detailed information is as follows:

Python version Sklearn version Runtime(s) Peak memory(MB) Algo
3.8.13 1.1.1 89.47 2205.30 brute
3.8.13 1.0.2 90.31 2205.85 brute
3.7.10 1.0.2 86.89 2206.15 brute
3.8.13 1.1.1 3.86 233.28 ball_tree
3.8.13 1.0.2 4.35 232.80 ball_tree
3.7.10 1.0.2 3.56 233.05 ball_tree

I print the peak memory by tracemalloc.get_traced_memory() . The algorithm brute seems to cause the runtime and memory to increase.

Device info

OS : Ubuntu CPU : Intel® Core™ i9-9900K CPU GPU : TITAN V

Read more comments on GitHub >

github_iconTop Results From Across the Web

Poor performance of KNeighborsClassifier for sklearn ... - GitHub
Poor performance of KNeighborsClassifier for sklearn version >=0.24.2 Poor performance of KNeighborsClassifier for sklearn version >=0.24.2. Assign #29559 ...
Read more >
Version 0.24.2 — scikit-learn 1.2.0 documentation
The version of libomp used to build the wheels was too recent for older macOS versions. This issue has been fixed for 0.24.1...
Read more >
How to upgrade the classifier to the latest version of scikit-learn
Is there a natural way to upgrade my TfidfVectorizer to prevent any problems? Should I better uninstall scikit-learn 0.18.1 and install version 0.18...
Read more >
KNN Classification Tutorial using Sklearn Python - DataCamp
from sklearn.neighbors import KNeighborsClassifier model ... To understand model performance, dividing the dataset into a training set and a test set is a ......
Read more >
Introduction to scikit-learn - IT4Innovations events (Indico)
Current version: 0.24.2 ... Benefits from performance optimizations of BLAS, FFT, ... scikit-learn available as Python module:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found