Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SVC's predict_proba(...) predicts the exact same probability for wildly different inputs

See original GitHub issue

I’m using scikit-learn version 0.24.1.

I suspect some form of memoization or extreme low sensitivity happening in the SVC predict_proba(...) implementation whereby once a SVC model is built from a Pipeline with inputs scaling preprocessing and calibration and when setting the random_state argument; I get the exact same predicted probability for wildly different inputs.

I also checked the decision_function(..) result and it returns the exact same value for two wildly different x_test inputs.

This issue only happens when setting the argument random_state e.g. to zero. It is hard to create a MRE here also because of NDA and confidentiality of the dataset and work but I propose to export the pipeline like this:

from joblib import dump, load
dump(model, '/some/where/model.joblib')

and provide the two wildly different inputs leading to the same exact predict_proba(...) and decision_function(...) results. Would that be a viable solution to reproduce and fix the possible bug?

A simplified relevant example of my pipeline is the following:

pipeline = Pipeline(steps=[('preprocess', MaxAbsScaler()),
                           ('svm', SVC(kernel='rbf', probability=True, 
                                       random_state=0, decision_function_shape='ovr', 
                                       break_ties=False, ...))]
params = [{...}]
model = GridSearchCV(pipeline, params, ...).fit(x_train, y_train).best_estimator_

# and now I get
prob1 = model.predict_proba(x_test1)
prob2 = model.predict_proba(x_test2_wildly_diff_from_test1)
assert np.abs(prob1 - prob2) < 1e-10

For example, using random_seed=0:

>>> import numpy as np
>>> from scipy.spatial import distance
# how far are the two x_test vectors from each other?
>>> distance.cosine(x_test1, x_test2)  # EDIT: very far apart angle-wise e.g. 90°
1.0280449512858494
>>> distance.euclidean(x_test1, x_test2) # very far in euclidean distance
30675.221284568033
>>> model.predict_proba(x_test1)
array([[0.86879653, 0.13120347]])
>>> model.predict_proba(x_test2)
array([[0.86879653, 0.13120347]])
>>> model.decision_function(x_test1)
array([-0.03474242])
>>> model.decision_function(x_test2)
array([-0.03474242])

UPDATE: what bothers me is not that they are close, what spooks me is that the two completely different vectors end up getting the exact same distance (to the 1e-10 decimal place) from the decision boundary decision_function(...) and therefore the exact probability too predict_proba(...). I’m still thinking how to further validate and scrutinize this case …

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

NicolasHugcommented, Feb 12, 2021

If there is an issue, it’s probably related to decision_function, not predict_proba (since the proba comes from the calibration of the deciison function as previously noted). Also, we should have decision_function(x1) == decision_function(x2) iff predict_proba(x1) == predict_proba(x2).

This issue only happens when setting the argument random_state e.g. to zero

This is quite surprising because random_state only affects the calibration, not the decision function. That would mean that random_state is affecting decision_function?

Also, this might be a normal behavior: just because 2 samples are far apart in the input space doesn’t mean that they’ll be far apart in the projected space. You could try and manually compute summ_over_svs K(sv, x) for both input to confirm that? (EDIT: actually I’m not so sure how to confirm that. K(x1, x2) will give you the inner product of both samples in the transformed space but that’s it)

Side note: you seem to have a binary classification problem so this will be ignored: decision_function_shape='ovr'. Same for break_ties.

1reaction

glemaitrecommented, Feb 12, 2021

Could you provide the 2 output functions output. I don’t recall how SVM is creating probability but it could be a corner case that it is not that surprising.

Edit: I see that you mentioned it:

I also checked the decision_function(…) result and it returns the exact same value for two wildly different x_test inputs.

If I recall, it is a kind of calibration with a sigmoid. I can imagine that the probabilities could be close when you are in the extremum of the sigmoid functions. But this is just a hunch.