KNeighborsRegressor gives different results for different n_jobs values
See original GitHub issueDescription
When using ‘seuclidean’ distance metric, the algorithm produces different predictions for different values of the n_jobs parameter if no V is passed as additional metric_params. This implies that if configured with n_jobs=-1 two different machines show different results depending on the number of cores. The same happens for ‘mahalanobis’ distance metric if no V and VI are passed as metric_params.
Steps/Code to Reproduce
# Import required packages
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
# Prepare the dataset
dataset = load_boston()
target = dataset.target
data = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Split the dataset
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_1 = KNeighborsRegressor(n_jobs=1, algorithm='brute', metric='seuclidean')
model_n_jobs_1.fit(X_train, y_train)
np.sum(model_n_jobs_1.predict(X_test)) # --> 2127.99999
# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_3 = KNeighborsRegressor(n_jobs=3, algorithm='brute', metric='seuclidean')
model_n_jobs_3.fit(X_train, y_train)
np.sum(model_n_jobs_3.predict(X_test)) # --> 2129.38
# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_all = KNeighborsRegressor(n_jobs=-1, algorithm='brute', metric='seuclidean')
model_n_jobs_all.fit(X_train, y_train)
np.sum(model_n_jobs_all.predict(X_test)) # --> 2125.29999
Expected Results
The prediction should be always the same and not depend on the value passed to the n_jobs parameter.
Actual Results
The prediction value changes depending on the value passed to n_jobs which, in case of n_jobs=-1, makes the prediction depend on the number of cores of the machine running the code.
Versions
System
python: 3.6.6 (default, Jun 28 2018, 04:42:43) [GCC 5.4.0 20160609]
executable: /home/mcorella/.local/share/virtualenvs/outlier_detection-8L4UL10d/bin/python3.6
machine: Linux-4.15.0-39-generic-x86_64-with-Ubuntu-16.04-xenial
BLAS
macros: NO_ATLAS_INFO=1, HAVE_CBLAS=None
lib_dirs: /usr/lib
cblas_libs: cblas
Python deps
pip: 18.1
setuptools: 40.5.0
sklearn: 0.20.0
numpy: 1.15.4
scipy: 1.1.0
Cython: None
pandas: 0.23.4
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
sklearn.neighbors.KNeighborsRegressor
Weight function used in prediction. Possible values: 'uniform' : uniform weights. All points in each neighborhood are weighted equally. 'distance' : weight ...
Read more >KNeighbors Regressor .predict() function giving suspiciously ...
If I train a KNeighborsRegressor (via scikit-learn) and then want to ... KNN classifier gives different results when using predict and ...
Read more >K-Nearest Neighbors Algorithm | KNN Regression Python
The KNN algorithm uses 'feature similarity' to predict the values of any new data points. This means that the new point is assigned...
Read more >Guide to the K-Nearest Neighbors Algorithm in Python and ...
For the regression, we need to predict another median house value. ... T transposes the results, transforming rows into columns X.describe() ...
Read more >KNeighborsRegressor — scikit-fda 0.7.1 ... - Read the Docs
KNeighborsRegressor (n_neighbors=5, weights='uniform', regressor='mean', ... have identical distances but different labels, the results will depend on the ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
that is music to my ears 😃
Thanks for the report.
I think I’ve been able to find the source of the bug. The predict method of
KNeighborsRegressor
callspairwise_distances
, which callsscipy.spatial.distance.cdist
because “seuclidean” is a scipy metric. cdist computes the pairwise distances between 2 observation samples X and Y.The issue is that when
n_jobs != 1
, Y is split into chunks and cdist is called on each (X, Y_chunk). But when V is not given, it’s computed in cdist asvar(vstack([X,Y_chunk]))
, hence it will be different for each chunk.The fix would be to compute V in
pairwise_distances
on whole (X, Y) before splitting into chunks.It would be worth check is something similar can happen with other metric params. EDIT: It seems there are no others apart from ‘mahalanobis’ with V and VI params mentionned above.