Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

predict values sometimes different from top predict_proba entries?

See original GitHub issue

Describe the bug

I’m not actually sure if this is a bug or my own misunderstanding. But I would expect the max value in a row of predict_proba to have an index exactly equal to the value in the corresponding row of predict. However this is not what I’m seeing. Instead, the values are equal up until ~25, at which point they start to diverge by 1, then by 2 at ~50 ,and 3 at around ~115.

Steps/Code to Reproduce

If this is actually a bug I can provide actual data, but for now this is my pipeline:

def new_text_clf():
    return Pipeline(
        [
            ("vect", StemmedCountVectorizer(ngram_range=(1, 2), max_df=0.8, min_df=3),),
            ("tfidf", TfidfTransformer(use_idf=True)),
            (
                "clf",
                SGDClassifier(
                    loss="modified_huber", penalty="l2", alpha=0.001, random_state=42
                ),
            ),
        ]
    )

And I’m running


> train = load_files(train_path, encoding="utf-8", decode_error="replace", shuffle=True, random_state=42,)
> test = load_files(test_path, encoding="utf-8", decode_error="replace", shuffle=True, random_state=42,
> train.target_names == test.target_names
True
> text_clf = new_text_clf().fit(train.data, train.target)
> [*test.target]
[61, 13, 11, 89, 11, 71, 118, 33, 52, 57, 16, 57, 100, 24, ...]
> [*text_clf.predict(test.data)]
[61, 16, 11, 16, 11, 89, 26, 33, 16, 57, 16, 118, 11, 26, ...]
> [max(enumerate(prob), key=lambda p: p[1])[0] for prob in text_clf.predict_proba(test.data)]
[60, 16, 11, 16, 11, 87, 25, 32, 16, 56, 16, 115, 11, 25, ...]

Expected Results

I would expect [*text_clf.predict(test.data)] to exactly equal [max(enumerate(prob), key=lambda p: p[1])[0] for prob in text_clf. predict_proba(test.data)]

Actual Results

They diverge with increasing categories.

Versions

System: python: 3.7.7 (default, Mar 10 2020, 15:43:03) [Clang 11.0.0 (clang-1100.0.33.17)] executable: /usr/local/opt/python/bin/python3.7 machine: Darwin-18.7.0-x86_64-i386-64bit

Python dependencies: pip: 20.0.2 setuptools: 46.0.0 sklearn: 0.22.2.post1 numpy: 1.18.3 scipy: 1.4.1 Cython: None pandas: None matplotlib: 3.2.1 joblib: 0.14.1

Built with OpenMP: True

Edit:

I notice len(text_clf.predict_proba(test.data)[0]) is 119 whereas len(test.target_names) is 122. This is likely related, my guess is theres a mapping between proba indexes and target_names indexes that isn’t a simple 1:1, and it’s not being applied correctly. This is probably because 3 categories have either 0 test or train data. Not sure which. But I do think this is still a bug.

I believe the problem is this logic:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_base.py#L312

is missing from the predict_proba implementation: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_stochastic_gradient.py#L1025-L1055

Issue Analytics

State:
Created 3 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

NicolasHugcommented, Apr 25, 2020

My bad, I probably read too fast. The kind of issue in the title predict values sometimes different from top predict_proba is one that comes up regularly, and 99% of the time it’s because of ties, and I got influenced by that.

1reaction

JacksonKearlcommented, Apr 25, 2020

I see now that it’s documented that classes are ordered as they are in 'self.classes_'.

Though this was surprising to me, and seemingly to you both as from the get go I said that the problem went away after passing through classes_ and you said that that wasn’t the expected behaviour, but rather that there might be issues due to ties or numerical instability.

Have a nice day.

Top Results From Across the Web

SciKit-Learn: predict values sometimes different from top ...

The issue is predict_proba does not convert from its internal representation of categories to the dataset's representation before returning ...

Difference Between predict and predict_proba in scikit-learn

The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the ......

How to Make Predictions with scikit-learn

How to predict classification or regression outcomes with scikit-learn models in Python. ... Once you choose and fit a final machine learning ...

Predicting Columns in a Table - In Depth - AutoGluon

You can specify various hyperparameter values for each type of model. ... By default, predict() and predict_proba() will utilize the model that AutoGluon ......

Why does the predict_proba function return 2 columns?

sample two , with only feature x = [3000]. let's say that output of your prediction looks like this: In: print(y_pred) Out: [[0.28 ......