Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KNeighborsRegressor gives different results for different n_jobs values

See original GitHub issue

Description

When using ‘seuclidean’ distance metric, the algorithm produces different predictions for different values of the n_jobs parameter if no V is passed as additional metric_params. This implies that if configured with n_jobs=-1 two different machines show different results depending on the number of cores. The same happens for ‘mahalanobis’ distance metric if no V and VI are passed as metric_params.

Steps/Code to Reproduce

# Import required packages
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor

# Prepare the dataset
dataset = load_boston()
target = dataset.target
data = pd.DataFrame(dataset.data, columns=dataset.feature_names)

# Split the dataset
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)

# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_1 = KNeighborsRegressor(n_jobs=1, algorithm='brute', metric='seuclidean')
model_n_jobs_1.fit(X_train, y_train)
np.sum(model_n_jobs_1.predict(X_test)) # --> 2127.99999

# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_3 = KNeighborsRegressor(n_jobs=3, algorithm='brute', metric='seuclidean')
model_n_jobs_3.fit(X_train, y_train)
np.sum(model_n_jobs_3.predict(X_test)) # --> 2129.38

# Create a regressor with seuclidean distance and passing V as additional argument
model_n_jobs_all = KNeighborsRegressor(n_jobs=-1, algorithm='brute', metric='seuclidean')
model_n_jobs_all.fit(X_train, y_train)
np.sum(model_n_jobs_all.predict(X_test)) # --> 2125.29999

Expected Results

The prediction should be always the same and not depend on the value passed to the n_jobs parameter.

Actual Results

The prediction value changes depending on the value passed to n_jobs which, in case of n_jobs=-1, makes the prediction depend on the number of cores of the machine running the code.

Versions

System

python: 3.6.6 (default, Jun 28 2018, 04:42:43)  [GCC 5.4.0 20160609]
executable: /home/mcorella/.local/share/virtualenvs/outlier_detection-8L4UL10d/bin/python3.6
machine: Linux-4.15.0-39-generic-x86_64-with-Ubuntu-16.04-xenial

BLAS

macros: NO_ATLAS_INFO=1, HAVE_CBLAS=None
lib_dirs: /usr/lib
cblas_libs: cblas

Python deps

pip: 18.1
setuptools: 40.5.0
sklearn: 0.20.0
numpy: 1.15.4
scipy: 1.1.0
Cython: None
pandas: 0.23.4

Issue Analytics

State:
Created 5 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

jeremiedbbcommented, Nov 30, 2018

Do metrics need a fit method??

that is music to my ears 😃

1reaction

jeremiedbbcommented, Nov 26, 2018

Thanks for the report.

I think I’ve been able to find the source of the bug. The predict method of KNeighborsRegressor calls pairwise_distances, which calls scipy.spatial.distance.cdist because “seuclidean” is a scipy metric. cdist computes the pairwise distances between 2 observation samples X and Y.

The issue is that when n_jobs != 1, Y is split into chunks and cdist is called on each (X, Y_chunk). But when V is not given, it’s computed in cdist as var(vstack([X,Y_chunk])), hence it will be different for each chunk.

The fix would be to compute V in pairwise_distances on whole (X, Y) before splitting into chunks.

It would be worth check is something similar can happen with other metric params. EDIT: It seems there are no others apart from ‘mahalanobis’ with V and VI params mentionned above.

Top Results From Across the Web

sklearn.neighbors.KNeighborsRegressor

Weight function used in prediction. Possible values: 'uniform' : uniform weights. All points in each neighborhood are weighted equally. 'distance' : weight ...

KNeighbors Regressor .predict() function giving suspiciously ...

If I train a KNeighborsRegressor (via scikit-learn) and then want to ... KNN classifier gives different results when using predict and ...

K-Nearest Neighbors Algorithm | KNN Regression Python

The KNN algorithm uses 'feature similarity' to predict the values of any new data points. This means that the new point is assigned...

Guide to the K-Nearest Neighbors Algorithm in Python and ...

For the regression, we need to predict another median house value. ... T transposes the results, transforming rows into columns X.describe() ...

KNeighborsRegressor — scikit-fda 0.7.1 ... - Read the Docs

KNeighborsRegressor (n_neighbors=5, weights='uniform', regressor='mean', ... have identical distances but different labels, the results will depend on the ...