question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SVC's predict_proba(...) predicts the exact same probability for wildly different inputs

See original GitHub issue

I’m using scikit-learn version 0.24.1.

I suspect some form of memoization or extreme low sensitivity happening in the SVC predict_proba(...) implementation whereby once a SVC model is built from a Pipeline with inputs scaling preprocessing and calibration and when setting the random_state argument; I get the exact same predicted probability for wildly different inputs.

I also checked the decision_function(..) result and it returns the exact same value for two wildly different x_test inputs.

This issue only happens when setting the argument random_state e.g. to zero. It is hard to create a MRE here also because of NDA and confidentiality of the dataset and work but I propose to export the pipeline like this:

from joblib import dump, load
dump(model, '/some/where/model.joblib')

and provide the two wildly different inputs leading to the same exact predict_proba(...) and decision_function(...) results. Would that be a viable solution to reproduce and fix the possible bug?

A simplified relevant example of my pipeline is the following:

pipeline = Pipeline(steps=[('preprocess', MaxAbsScaler()),
                           ('svm', SVC(kernel='rbf', probability=True, 
                                       random_state=0, decision_function_shape='ovr', 
                                       break_ties=False, ...))]
params = [{...}]
model = GridSearchCV(pipeline, params, ...).fit(x_train, y_train).best_estimator_

# and now I get
prob1 = model.predict_proba(x_test1)
prob2 = model.predict_proba(x_test2_wildly_diff_from_test1)
assert np.abs(prob1 - prob2) < 1e-10

For example, using random_seed=0:

>>> import numpy as np
>>> from scipy.spatial import distance
# how far are the two x_test vectors from each other?
>>> distance.cosine(x_test1, x_test2)  # EDIT: very far apart angle-wise e.g. 90°
1.0280449512858494
>>> distance.euclidean(x_test1, x_test2) # very far in euclidean distance
30675.221284568033
>>> model.predict_proba(x_test1)
array([[0.86879653, 0.13120347]])
>>> model.predict_proba(x_test2)
array([[0.86879653, 0.13120347]])
>>> model.decision_function(x_test1)
array([-0.03474242])
>>> model.decision_function(x_test2)
array([-0.03474242])

UPDATE: what bothers me is not that they are close, what spooks me is that the two completely different vectors end up getting the exact same distance (to the 1e-10 decimal place) from the decision boundary decision_function(...) and therefore the exact probability too predict_proba(...). I’m still thinking how to further validate and scrutinize this case …

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
NicolasHugcommented, Feb 12, 2021

If there is an issue, it’s probably related to decision_function, not predict_proba (since the proba comes from the calibration of the deciison function as previously noted). Also, we should have decision_function(x1) == decision_function(x2) iff predict_proba(x1) == predict_proba(x2).

This issue only happens when setting the argument random_state e.g. to zero

This is quite surprising because random_state only affects the calibration, not the decision function. That would mean that random_state is affecting decision_function?

Also, this might be a normal behavior: just because 2 samples are far apart in the input space doesn’t mean that they’ll be far apart in the projected space. You could try and manually compute summ_over_svs K(sv, x) for both input to confirm that? (EDIT: actually I’m not so sure how to confirm that. K(x1, x2) will give you the inner product of both samples in the transformed space but that’s it)

Side note: you seem to have a binary classification problem so this will be ignored: decision_function_shape='ovr'. Same for break_ties.

1reaction
glemaitrecommented, Feb 12, 2021

Could you provide the 2 output functions output. I don’t recall how SVM is creating probability but it could be a corner case that it is not that surprising.

Edit: I see that you mentioned it:

I also checked the decision_function(…) result and it returns the exact same value for two wildly different x_test inputs.

If I recall, it is a kind of calibration with a sigmoid. I can imagine that the probabilities could be close when you are in the extremum of the sigmoid functions. But this is just a hunch.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Make Predictions with scikit-learn
In this tutorial, you will discover exactly how you can make classification and regression predictions with a finalized machine learning model ...
Read more >
My RandomForest keeps returning the exact same ...
My RandomForest keeps returning the exact same probabilities for model.predict_proba() regardless of input · 1. just a note, n_estimators is the ...
Read more >
API Inconsitency of predict and predict_proba in SVC #13211
Compute predict as argmax(predict_proba when probability=True . ... Those are the same data with different labels, and the SVC was trained ...
Read more >
Difference Between predict and predict_proba in scikit-learn
The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the ......
Read more >
scikit-learn user guide
Apart from scikit-learn, another popular one is scikit-image. ... same data, the ouput of predict, predict_proba, transform, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found