SVC's predict_proba(...) predicts the exact same probability for wildly different inputs
See original GitHub issueI’m using scikit-learn
version 0.24.1.
I suspect some form of memoization or extreme low sensitivity happening in the SVC predict_proba(...)
implementation whereby once a SVC model is built from a Pipeline with inputs scaling preprocessing and calibration and when setting the random_state
argument; I get the exact same predicted probability for wildly different inputs.
I also checked the decision_function(..)
result and it returns the exact same value for two wildly different x_test
inputs.
This issue only happens when setting the argument random_state
e.g. to zero. It is hard to create a MRE here also because of NDA and confidentiality of the dataset and work but I propose to export the pipeline like this:
from joblib import dump, load
dump(model, '/some/where/model.joblib')
and provide the two wildly different inputs leading to the same exact predict_proba(...)
and decision_function(...)
results. Would that be a viable solution to reproduce and fix the possible bug?
A simplified relevant example of my pipeline is the following:
pipeline = Pipeline(steps=[('preprocess', MaxAbsScaler()),
('svm', SVC(kernel='rbf', probability=True,
random_state=0, decision_function_shape='ovr',
break_ties=False, ...))]
params = [{...}]
model = GridSearchCV(pipeline, params, ...).fit(x_train, y_train).best_estimator_
# and now I get
prob1 = model.predict_proba(x_test1)
prob2 = model.predict_proba(x_test2_wildly_diff_from_test1)
assert np.abs(prob1 - prob2) < 1e-10
For example, using random_seed=0
:
>>> import numpy as np
>>> from scipy.spatial import distance
# how far are the two x_test vectors from each other?
>>> distance.cosine(x_test1, x_test2) # EDIT: very far apart angle-wise e.g. 90°
1.0280449512858494
>>> distance.euclidean(x_test1, x_test2) # very far in euclidean distance
30675.221284568033
>>> model.predict_proba(x_test1)
array([[0.86879653, 0.13120347]])
>>> model.predict_proba(x_test2)
array([[0.86879653, 0.13120347]])
>>> model.decision_function(x_test1)
array([-0.03474242])
>>> model.decision_function(x_test2)
array([-0.03474242])
UPDATE: what bothers me is not that they are close, what spooks me is that the two completely different vectors end up getting the exact same distance (to the 1e-10 decimal place) from the decision boundary decision_function(...)
and therefore the exact probability too predict_proba(...)
. I’m still thinking how to further validate and scrutinize this case …
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
If there is an issue, it’s probably related to
decision_function
, notpredict_proba
(since the proba comes from the calibration of the deciison function as previously noted). Also, we should havedecision_function(x1) == decision_function(x2) iff predict_proba(x1) == predict_proba(x2)
.This is quite surprising because
random_state
only affects the calibration, not the decision function. That would mean thatrandom_state
is affectingdecision_function
?Also, this might be a normal behavior: just because 2 samples are far apart in the input space doesn’t mean that they’ll be far apart in the projected space. You could try and manually compute
summ_over_svs K(sv, x)
for both input to confirm that? (EDIT: actually I’m not so sure how to confirm that.K(x1, x2)
will give you the inner product of both samples in the transformed space but that’s it)Side note: you seem to have a binary classification problem so this will be ignored:
decision_function_shape='ovr'
. Same forbreak_ties
.Could you provide the 2 output functions output. I don’t recall how SVM is creating probability but it could be a corner case that it is not that surprising.
Edit: I see that you mentioned it:
If I recall, it is a kind of calibration with a sigmoid. I can imagine that the probabilities could be close when you are in the extremum of the sigmoid functions. But this is just a hunch.